feature selection & maximum entropy advanced statistical methods in nlp ling 572 january 26,...
TRANSCRIPT
![Page 1: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/1.jpg)
1
Feature Selection & Maximum Entropy
Advanced Statistical Methods in NLPLing 572
January 26, 2012
![Page 2: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/2.jpg)
2
RoadmapFeature selection and weighting
Feature weightingChi-square feature selectionChi-square feature selection example
HW #4
Maximum Entropy Introduction: Maximum Entropy PrincipleMaximum Entropy NLP examples
![Page 3: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/3.jpg)
3
Feature Selection RecapProblem: Curse of dimensionality
Data sparseness, computational cost, overfitting
![Page 4: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/4.jpg)
4
Feature Selection RecapProblem: Curse of dimensionality
Data sparseness, computational cost, overfitting
Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|
![Page 5: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/5.jpg)
5
Feature Selection RecapProblem: Curse of dimensionality
Data sparseness, computational cost, overfitting
Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|Approaches:
Global & local approachesFeature extraction:
New features in r’ transformations of features in r
![Page 6: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/6.jpg)
6
Feature Selection RecapProblem: Curse of dimensionality
Data sparseness, computational cost, overfitting
Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|Approaches:
Global & local approachesFeature extraction:
New features in r’ transformations of features in r
Feature selection: Wrapper techniques
![Page 7: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/7.jpg)
7
Feature Selection RecapProblem: Curse of dimensionality
Data sparseness, computational cost, overfitting
Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|Approaches:
Global & local approachesFeature extraction:
New features in r’ transformations of features in r
Feature selection: Wrapper techniques Feature scoring
![Page 8: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/8.jpg)
8
Feature WeightingFor text classification, typical weights include:
![Page 9: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/9.jpg)
9
Feature WeightingFor text classification, typical weights include:
Binary: weights in {0,1}
![Page 10: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/10.jpg)
10
Feature WeightingFor text classification, typical weights include:
Binary: weights in {0,1}
Term frequency (tf): # occurrences of tk in document di
![Page 11: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/11.jpg)
11
Feature WeightingFor text classification, typical weights include:
Binary: weights in {0,1}
Term frequency (tf): # occurrences of tk in document di
Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs
idf = log (N/(1+dfk))
![Page 12: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/12.jpg)
12
Feature WeightingFor text classification, typical weights include:
Binary: weights in {0,1}
Term frequency (tf): # occurrences of tk in document di
Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs
idf = log (N/(1+dfk))
tfidf = tf*idf
![Page 13: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/13.jpg)
13
Chi SquareTests for presence/absence of relation between
random variables
![Page 14: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/14.jpg)
14
Chi SquareTests for presence/absence of relation between
random variablesBivariate analysis tests 2 random variables
Can test strength of relationship
(Strictly speaking) doesn’t test direction
![Page 15: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/15.jpg)
15
Chi SquareTests for presence/absence of relation between
random variablesBivariate analysis tests 2 random variables
Can test strength of relationship
![Page 16: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/16.jpg)
16
Chi SquareTests for presence/absence of relation between
random variablesBivariate analysis tests 2 random variables
Can test strength of relationship
(Strictly speaking) doesn’t test direction
![Page 17: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/17.jpg)
17
Chi Square ExampleCan gender predict shoe choice?
Due to F. Xia
![Page 18: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/18.jpg)
18
Chi Square ExampleCan gender predict shoe choice?
A: male/female Features
Due to F. Xia
![Page 19: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/19.jpg)
19
Chi Square ExampleCan gender predict shoe choice?
A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}
Due to F. Xia
![Page 20: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/20.jpg)
20
Chi Square ExampleCan gender predict shoe choice?
A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}
Due to F. Xia
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
![Page 21: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/21.jpg)
21
Comparing DistributionsObserved distribution (O):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
Due to F. Xia
![Page 22: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/22.jpg)
22
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
Due to F. Xia
![Page 23: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/23.jpg)
23
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 50
Female 50
Total 19 22 20 25 14 100
Due to F. Xia
![Page 24: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/24.jpg)
24
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 9.5 50
Female 9.5 50
Total 19 22 20 25 14 100
Due to F. Xia
![Page 25: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/25.jpg)
25
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 9.5 11 50
Female 9.5 11 50
Total 19 22 20 25 14 100
Due to F. Xia
![Page 26: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/26.jpg)
26
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 9.5 11 10 50
Female 9.5 11 10 50
Total 19 22 20 25 14 100
Due to F. Xia
![Page 27: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/27.jpg)
27
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 9.5 11 10 12.5 50
Female 9.5 11 10 12.5 50
Total 19 22 20 25 14 100
Due to F. Xia
![Page 28: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/28.jpg)
28
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 9.5 11 10 12.5 7 50
Female 9.5 11 10 12.5 7 50
Total 19 22 20 25 14 100
Due to F. Xia
![Page 29: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/29.jpg)
29
Computing Chi SquareExpected value for cell=
row_total*column_total/table_total
![Page 30: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/30.jpg)
30
Computing Chi SquareExpected value for cell=
row_total*column_total/table_total
![Page 31: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/31.jpg)
31
Computing Chi SquareExpected value for cell=
row_total*column_total/table_total
X2=(6-9.5)2/9.5+
![Page 32: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/32.jpg)
32
Computing Chi SquareExpected value for cell=
row_total*column_total/table_total
X2=(6-9.5)2/9.5+(17-11)2/11
![Page 33: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/33.jpg)
33
Computing Chi SquareExpected value for cell=
row_total*column_total/table_total
X2=(6-9.5)2/9.5+(17-11)2/11+.. = 14.026
![Page 34: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/34.jpg)
34
Calculating X2
Tabulate contigency table of observed values: O
![Page 35: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/35.jpg)
35
Calculating X2
Tabulate contigency table of observed values: O
Compute row, column totals
![Page 36: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/36.jpg)
36
Calculating X2
Tabulate contigency table of observed values: O
Compute row, column totals
Compute table of expected values, given row/colAssuming no association
![Page 37: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/37.jpg)
37
Calculating X2
Tabulate contigency table of observed values: O
Compute row, column totals
Compute table of expected values, given row/colAssuming no association
Compute X2
![Page 38: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/38.jpg)
38
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
![Page 39: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/39.jpg)
39
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
!ci ci Total
!tk
tk
total
![Page 40: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/40.jpg)
40
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
!ci ci Total
!tk a+b
tk c+d
total a+c b+d N
![Page 41: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/41.jpg)
41
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
!ci ci Total
!tk (a+b)(a+c)/N
a+b
tk c+d
total a+c b+d N
![Page 42: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/42.jpg)
42
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
!ci ci Total
!tk (a+b)(a+c)/N
(a+b)(b+d)/N a+b
tk c+d
total a+c b+d N
![Page 43: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/43.jpg)
43
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
!ci ci Total
!tk (a+b)(a+c)/N
(a+b)(b+d)/N a+b
tk (c+d)(a+c)/N c+d
total a+c b+d N
![Page 44: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/44.jpg)
44
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
!ci ci Total
!tk (a+b)(a+c)/N
(a+b)(b+d)/N a+b
tk (c+d)(a+c)/N (c+d)(b+d)/N c+d
total a+c b+d N
![Page 45: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/45.jpg)
45
X2 TestTest whether random variables are independent
![Page 46: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/46.jpg)
46
X2 TestTest whether random variables are independent
Null hypothesis: R.V.s are independent
![Page 47: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/47.jpg)
47
X2 TestTest whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:
![Page 48: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/48.jpg)
48
X2 TestTest whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:Compute degrees of freedom
![Page 49: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/49.jpg)
49
X2 TestTest whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:Compute degrees of freedom
df = (# rows -1)(# cols -1)
![Page 50: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/50.jpg)
50
X2 TestTest whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:Compute degrees of freedom
df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4
![Page 51: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/51.jpg)
51
X2 TestTest whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:Compute degrees of freedom
df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4
Test probability of X2 statistic value X2 table
![Page 52: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/52.jpg)
52
X2 TestTest whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic: Compute degrees of freedom
df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4
Test probability of X2 statistic value X2 table
If probability is low – below some significance level Can reject null hypothesis
![Page 53: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/53.jpg)
53
Requirements for X2 TestEvents assumed independent, same distribution
![Page 54: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/54.jpg)
54
Requirements for X2 TestEvents assumed independent, same distribution
Outcomes must be mutually exclusive
![Page 55: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/55.jpg)
55
Requirements for X2 TestEvents assumed independent, same distribution
Outcomes must be mutually exclusive
Raw frequencies, not percentages
![Page 56: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/56.jpg)
56
Requirements for X2 TestEvents assumed independent, same distribution
Outcomes must be mutually exclusive
Raw frequencies, not percentages
Sufficient values per cell: > 5
![Page 57: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/57.jpg)
57
X2 Example
![Page 58: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/58.jpg)
58
X2 ExampleShared Task Evaluation:
Topic Detection and Tracking (aka TDT)
![Page 59: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/59.jpg)
59
X2 ExampleShared Task Evaluation:
Topic Detection and Tracking (aka TDT)
Sub-task: Topic Tracking TaskGiven a small number of exemplar documents (1-
4)Define a topicCreate a model that allows tracking of the topic
I.e. find all subsequent documents on this topic
![Page 60: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/60.jpg)
60
X2 ExampleShared Task Evaluation:
Topic Detection and Tracking (aka TDT)
Sub-task: Topic Tracking TaskGiven a small number of exemplar documents (1-
4)Define a topicCreate a model that allows tracking of the topic
I.e. find all subsequent documents on this topic
Exemplars: 1-4 newswire articles300-600 words each
![Page 61: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/61.jpg)
61
ChallengesMany news articles look alike
Create a profile (feature representation)Highlights terms strongly associated with current
topic Differentiate from all other topics
![Page 62: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/62.jpg)
62
ChallengesMany news articles look alike
Create a profile (feature representation)Highlights terms strongly associated with current
topic Differentiate from all other topics
Not all documents labeledOnly a small subset belong to topics of interest
Differentiate from other topics AND ‘background’
![Page 63: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/63.jpg)
63
ApproachX2 feature selection:
![Page 64: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/64.jpg)
64
ApproachX2 feature selection:
Assume terms have binary representation
![Page 65: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/65.jpg)
65
ApproachX2 feature selection:
Assume terms have binary representation Positive class term occurrences from exemplar docs
![Page 66: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/66.jpg)
66
ApproachX2 feature selection:
Assume terms have binary representation Positive class term occurrences from exemplar docsNegative class term occurrences from
other class exemplars, ‘earlier’ uncategorized docs
![Page 67: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/67.jpg)
67
ApproachX2 feature selection:
Assume terms have binary representation Positive class term occurrences from exemplar docsNegative class term occurrences from
other class exemplars, ‘earlier’ uncategorized docs
Compute X2 for termsRetain terms with highest X2 scores
Keep top N terms
![Page 68: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/68.jpg)
68
ApproachX2 feature selection:
Assume terms have binary representation Positive class term occurrences from exemplar docsNegative class term occurrences from
other class exemplars, ‘earlier’ uncategorized docs
Compute X2 for termsRetain terms with highest X2 scores
Keep top N terms
Create one feature set per topic to be tracked
![Page 69: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/69.jpg)
69
Tracking ApproachBuild vector space model
![Page 70: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/70.jpg)
70
Tracking ApproachBuild vector space model
Feature weighting: tf*idfwith some modifications
![Page 71: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/71.jpg)
71
Tracking ApproachBuild vector space model
Feature weighting: tf*idfwith some modifications
Distance measure: Cosine similarity
![Page 72: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/72.jpg)
72
Tracking ApproachBuild vector space model
Feature weighting: tf*idfwith some modifications
Distance measure: Cosine similarity
Select documents scoring above thresholdFor each topic
![Page 73: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/73.jpg)
73
Tracking ApproachBuild vector space model
Feature weighting: tf*idfwith some modifications
Distance measure: Cosine similarity
Select documents scoring above thresholdFor each topic
Result: Improved retrieval
![Page 74: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/74.jpg)
74
HW #4Topic: Feature Selection for kNN
Build a kNN classifier using:Euclidean distance, Cosine Similarity
Write a program to compute X2 on a data set
Use X2 at different significance levels to filter
Compare the effects of different feature filteringon kNN classification
![Page 75: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/75.jpg)
75
Maximum Entropy
![Page 76: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/76.jpg)
76
Maximum Entropy“MaxEnt”:
Popular machine learning technique for NLP
First uses in NLP circa 1996 – Rosenfeld, Berger
Applied to a wide range of tasks
Sentence boundary detection (MxTerminator, Ratnaparkhi), POS tagging (Ratnaparkhi, Berger), topic segmentation (Berger), Language modeling (Rosenfeld), prosody labeling, etc….
![Page 77: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/77.jpg)
77
Readings & CommentsSeveral readings:
(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): Tutorial
![Page 78: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/78.jpg)
78
Readings & CommentsSeveral readings:
(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’
Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture
![Page 79: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/79.jpg)
79
Readings & CommentsSeveral readings:
(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’
Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture
Going forward: Techniques more complex
Goal: Understand basic model, concepts Training esp. complex – we’ll discuss, but not implement
![Page 80: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/80.jpg)
80
Notation NoteNot entirely consistent:
We’ll use: input = x; output=y; pair = (x,y)
Consistent with Berger, 1996
Ratnaparkhi, 1996: input = h; output=t; pair = (h,t)
Klein/Manning, ‘03: input = d; output=c; pair = (c,d)
![Page 81: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/81.jpg)
81
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to
learn a model Θ s.t. given a new x, can predict label y.
![Page 82: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/82.jpg)
82
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to
learn a model Θ s.t. given a new x, can predict label y.
Different types of models: Joint models (aka generative models) estimate
P(x,y) by maximizing P(X,Y|Θ)
![Page 83: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/83.jpg)
83
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to
learn a model Θ s.t. given a new x, can predict label y.
Different types of models: Joint models (aka generative models) estimate
P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etc
![Page 84: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/84.jpg)
84
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to
learn a model Θ s.t. given a new x, can predict label y.
Different types of models: Joint models (aka generative models) estimate
P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative
frequency
![Page 85: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/85.jpg)
85
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn
a model Θ s.t. given a new x, can predict label y.
Different types of models: Joint models (aka generative models) estimate P(x,y)
by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative frequency
Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …
![Page 86: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/86.jpg)
86
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn
a model Θ s.t. given a new x, can predict label y.
Different types of models: Joint models (aka generative models) estimate P(x,y)
by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative
frequencyConditional (aka discriminative) models estimate P(y|
x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …Computing weights more complex
![Page 87: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/87.jpg)
Naïve Bayes Model
Naïve Bayes Model assumes features f are independent of each other, given the class C
c
f1 f2 f3 fk
![Page 88: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/88.jpg)
88
Naïve Bayes ModelMakes assumption of conditional independence
of features given class
![Page 89: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/89.jpg)
89
Naïve Bayes ModelMakes assumption of conditional independence
of features given class
However, this is generally unrealistic
![Page 90: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/90.jpg)
90
Naïve Bayes ModelMakes assumption of conditional independence
of features given class
However, this is generally unrealistic
P(“cuts”|politics) = pcuts
![Page 91: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/91.jpg)
91
Naïve Bayes ModelMakes assumption of conditional independence
of features given class
However, this is generally unrealistic
P(“cuts”|politics) = pcuts
What about P(“cuts”|politics,”budget”) ?= pcuts
![Page 92: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/92.jpg)
92
Naïve Bayes ModelMakes assumption of conditional independence
of features given class
However, this is generally unrealistic
P(“cuts”|politics) = pcuts
What about P(“cuts”|politics,”budget”) ?= pcuts
Would like a model that doesn’t assume
![Page 93: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/93.jpg)
Model ParametersOur model:
c*= argmaxc P(c)ΠjP(fj|c)
Types of parametersTwo:
P(C): Class priorsP(fj|c): Class conditional feature probabilities
Features in total |C|+|VC|, if features are words in vocabulary V
![Page 94: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/94.jpg)
94
Weights in Naïve Bayes
c1 c2 c3 … ck
f1 P(f1|c1) P(f1|c2) P(f1|c3) P(f1|ck)
f2 P(f2|c1) P(f2|c2) …
… …
f|V| P(f|V||,c1)
![Page 95: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/95.jpg)
95
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weights
![Page 96: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/96.jpg)
96
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weightsP(y|x) =
![Page 97: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/97.jpg)
97
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weightsP(y|x) =
![Page 98: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/98.jpg)
98
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weightsP(y|x) =
![Page 99: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/99.jpg)
99
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weightsP(y|x) =
MaxEnt:Weights are real numbers; any magnitude, sign
![Page 100: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/100.jpg)
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weightsP(y|x) =
MaxEnt:Weights are real numbers; any magnitude, signP(y|x) =
100
![Page 101: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/101.jpg)
MaxEnt OverviewPrediction:
P(y|x)
101
![Page 102: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/102.jpg)
MaxEnt OverviewPrediction:
P(y|x)
fj (x,y): binary feature function, indicating presence of feature in instance x of class y
102
![Page 103: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/103.jpg)
MaxEnt OverviewPrediction:
P(y|x)
fj (x,y): binary feature function, indicating presence of feature in instance x of class y
λj : feature weights, learned in training
103
![Page 104: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/104.jpg)
MaxEnt OverviewPrediction:
P(y|x)
fj (x,y): binary feature function, indicating presence of feature in instance x of class y
λj : feature weights, learned in training
Prediction: Compute P(y|x), pick highest y
104
![Page 105: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/105.jpg)
Weights in MaxEnt
c1 c2 c3 … ck
f1 λ1 λ8…
f2 λ2 …
… …
f|V| λ6
105
![Page 106: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/106.jpg)
Maximum Entropy Principle
106
![Page 107: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/107.jpg)
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little as possible about what is unknown
107
![Page 108: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/108.jpg)
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little as possible about what is unknown
Maximum entropy = minimum commitment
108
![Page 109: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/109.jpg)
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little as possible about what is unknown
Maximum entropy = minimum commitment
Related to concepts like Occam’s razor
109
![Page 110: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/110.jpg)
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little as possible about what is unknown
Maximum entropy = minimum commitment
Related to concepts like Occam’s razor
Laplace’s “Principle of Insufficient Reason”:When one has no information to distinguish
between the probability of two events, the best strategy is to consider them equally likely
110
![Page 111: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/111.jpg)
Example I: (K&M 2003)Consider a coin flip
H(X)
111
![Page 112: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/112.jpg)
Example I: (K&M 2003)Consider a coin flip
H(X)
What values of P(X=H), P(X=T)maximize H(X)?
112
![Page 113: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/113.jpg)
Example I: (K&M 2003)Consider a coin flip
H(X)
What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2
If no prior information, best guess is fair coin
113
![Page 114: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/114.jpg)
Example I: (K&M 2003)Consider a coin flip
H(X)
What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2
If no prior information, best guess is fair coin
What if you know P(X=H) =0.3?
114
![Page 115: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/115.jpg)
Example I: (K&M 2003)Consider a coin flip
H(X)
What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2
If no prior information, best guess is fair coin
What if you know P(X=H) =0.3?P(X=T)=0.7
115
![Page 116: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/116.jpg)
Example II: MT (Berger, 1996)Task: English French machine translation
Specifically, translating ‘in’
Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}
Constraint:
116
![Page 117: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/117.jpg)
Example II: MT (Berger, 1996)Task: English French machine translation
Specifically, translating ‘in’
Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}
Constraint: p(dans)+p(en)+p(à)+p(au cours de)
+p(pendant)=1
If no other constraint, what is maxent model?
117
![Page 118: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/118.jpg)
Example II: MT (Berger, 1996)Task: English French machine translation
Specifically, translating ‘in’
Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}
Constraint: p(dans)+p(en)+p(à)+p(au cours de)
+p(pendant)=1
If no other constraint, what is maxent model?p(dans)=p(en)=p(à)=p(au cours
de)=p(pendant)=1/5
118
![Page 119: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/119.jpg)
Example II: MT (Berger, 1996)What we find out that translator uses dans or en
30%?Constraint
119
![Page 120: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/120.jpg)
Example II: MT (Berger, 1996)What we find out that translator uses dans or en
30%?Constraint: p(dans)+p(en)=3/10
Now what is maxent model?
120
![Page 121: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/121.jpg)
Example II: MT (Berger, 1996)What we find out that translator uses dans or en
30%?Constraint: p(dans)+p(en)=3/10
Now what is maxent model?p(dans)=p(en)=
121
![Page 122: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/122.jpg)
Example II: MT (Berger, 1996)What we find out that translator uses dans or en
30%?Constraint: p(dans)+p(en)=3/10
Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=
122
![Page 123: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/123.jpg)
Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?
Constraint: p(dans)+p(en)=3/10
Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30
What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??
123
![Page 124: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/124.jpg)
Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?
Constraint: p(dans)+p(en)=3/10
Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30
What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??
Not intuitively obvious…
124
![Page 125: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/125.jpg)
125
Example III: POS (K&M, 2003)
![Page 126: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/126.jpg)
126
Example III: POS (K&M, 2003)
![Page 127: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/127.jpg)
127
Example III: POS (K&M, 2003)
![Page 128: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/128.jpg)
128
Example III: POS (K&M, 2003)
![Page 129: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/129.jpg)
129
Example IIIProblem: Too uniform
What else do we know? Nouns more common than verbs
![Page 130: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eef5503460f94bff2e6/html5/thumbnails/130.jpg)
130
Example IIIProblem: Too uniform
What else do we know? Nouns more common than verbsSo fN={NN,NNS,NNP,NNPS}, and E[fN]=32/36
Also, proper nouns more frequent than common, soE[NNP,NNPS]=24/36
Etc