feature selection ling 572 fei xia week 4: 1/29/08 1
TRANSCRIPT
Feature selection
LING 572
Fei Xia
Week 4: 1/29/08
1
Creating attribute-value table
• Choose features:– Define feature templates
– Instantiate the feature templates– Dimensionality reduction: feature selection
• Feature weighting– The weight for fk: the whole column– The weight for fk in di: a cell
f1 f2 … fK y
x1
x2
…
2
An example: text classification task
• Define feature templates: – One template only: word
• Instantiate the feature templates– All the words appeared in the training (and test) data
• Dimensionality reduction: feature selection– Remove stop words
• Feature weighting– Feature value: term frequency (tf), or tf-idf
3
Outline
• Dimensionality reduction
• Some scoring functions **
• Chi-square score and Chi-square test
• Hw4
In this lecture, we will use “term” and “feature” interchangeably.
4
Dimensionality reduction (DR)
5
Dimensionality reduction (DR)
• What is DR?– Given a feature set r, create a new set r’, s.t.
• r’ is much smaller than r, and• the classification performance does not suffer too
much.
• Why DR?– ML algorithms do not scale well.– DR can reduce overfitting.
6
Types of DR
• r is the original feature set, r’ is the one after DR.
• Local DR vs. Global DR– Global DR: r’ is the same for every
category– Local DR: a different r’ for each category
• Term extraction vs. term selection7
Term selection vs. extraction
• Term selection: r’ is a subset of r– Wrapping methods: score terms by training and
evaluating classifiers. expensive and classifier-dependent
– Filtering methods
• Term extraction: terms in r’ are obtained by combinations or transformation of r terms.– Term clustering: – Latent semantic indexing (LSI)
8
Term selection by filtering
• Main idea: scoring terms according to predetermined numerical functions that measure the “importance” of the terms.
• It is fast and classifier-independent.
• Scoring functions:
– Information Gain– Mutual information– chi square– …
9
Quick summary so far
• DR: to reduce the number of features– Local DR vs. global DR– Term extraction vs. term selection
• Term extraction– Term clustering: – Latent semantic indexing (LSI)
• Term selection• Wrapping method• Filtering method: different functions
10
Some scoring functions
11
Basic distributions(treating features as binary)
Probability distributions on the event space of documents:
12
Calculating basic distributions
13
Term selection functions
• Intuition: for a category ci , the most valuable terms are those that are distributed most differently in the sets of possible and negative examples of ci.
14
Term selection functions
15
Information gain
• IG(Y|X): We must transmit Y. How many bits on average would it save us if both ends of the line knew X?
• Definition:
IG (Y, X) = H(Y) – H(Y|X)
16
Information gain
Information gain
17
More term selection functions**
18
More term selection functions**
19
Global DR
• For local DR, calculate f(tk, ci).
• For global DR, calculate one of the following:
20|C| is the number of classes
Which function works the best?
• It depends on– Classifiers– Data– …
• According to (Yang and Pedersen 1997):
21
Feature weighting
22
Alternative feature values
• Binary features: 0 or 1.
• Term frequency (TF): the number of times that tk appears in di.
• Inversed document frequency (IDF): log |D| /dk, where dk is the number of documents that contain tk.
• TFIDF = TF * IDF
• Normalized TFIDF:23
Feature weights
• Feature weight 2 {0,1}: same as DR
• Feature weight 2 R: iterative approach: – Ex: MaxEnt
Feature selection is a special case of feature weighting.
24
Summary so far
• Curse of dimensionality dimensionality reduction (DR)
• DR:– Term extraction– Term selection
• Wrapping method• Filtering method: different functions
25
Summary (cont)
• Functions:– Document frequency– Mutual information– Information gain– Gain ratio– Chi square– …
26
Chi square
27
Chi square
• An example: is gender a good feature for predicting footwear preference?– A: gender – B: footwear preference
• Bivariate tabular analysis: – Is there a relationship between two random variables
A and B in the data?– How strong is the relationship?– What is the direction of the relationship?
28
Raw frequencies
sandal sneaker Leather shoe
boots others
male 6 17 13 9 5
female 13 5 7 16 9
29
Feature: male/female
Classes: {sandal, sneaker, ….}
Two distributions
Sandal Sneaker Leather Boot Others Total
Male 6 17 13 9 5 50
Female 13 5 7 16 9 50
Total 19 22 20 25 14 100
Sandal Sneaker Leather Boot Others Total
Male 50
Female 50
Total 19 22 20 25 14 100
Observed distribution (O):
Expected distribution (E):
30
Two distributions
Sandal Sneaker Leather Boot Others Total
Male 6 17 13 9 5 50
Female 13 5 7 16 9 50
Total 19 22 20 25 14 100
Sandal Sneaker Leather Boot Others Total
Male 9.5 11 10 12.5 7 50
Female 9.5 11 10 12.5 7 50
Total 19 22 20 25 14 100
Observed distribution (O):
Expected distribution (E):
31
Chi square
• Expected value =
row total * column total / table total
• 2 = ij (Oij - Eij)2 / Eij
• 2 = (6-9.5)2/9.5 + (17-11)2/11+ ….
= 14.026
32
Calculating 2
• Fill out a contingency table of the observed values O
• Compute the row totals and column totals
• Calculate expected value for each cell assuming no association E
• Compute chi square: (O-E)2/E
33
When r=2 and c=2
O =
E =
34
2 test
35
Basic idea
• Null hypothesis (the tested hypothesis): no relation exists between two random variables.
• Calculate the probability of having the observation with that 2 value, assuming the hypothesis is true.
• If the probability is too small, reject the hypothesis.
36
Requirements
• The events are assumed to be independent and have the same distribution.
• The outcomes of each event must be mutually exclusive.
• At least 5 observations per cell.
• Collect raw frequencies, not percentages
37
Degree of freedom
• Degree of freedom df = (r – 1) (c – 1)
r: # of rows c: # of columns
• In this Ex: df=(2-1) (5-1)=4
38
2 distribution table
0.10 0.05 0.025 0.01 0.001
1 2.706 3.841 5.024 6.635 10.828
2 4.605 5.991 7.378 9.210 13.816
3 6.251 7.815 9.348 11.345 16.266
4 7.779 9.488 11.143 13.277 18.467
5 9.236 11.070 12.833 15.086 20.515
6 10.645 12.592 14.449 16.812 22.458
…
df=4 and 14.026 > 13.277 p<0.01 there is a significant relation
39
2 to P Calculator
http://faculty.vassar.edu/lowry/tabs.html#csq
40
Steps of 2 test
• Select significance level p0
• Calculate 2
• Compute the degree of freedom df = (r-1)(c-1)
• Calculate p given 2 value (or get the 20 for p0)
• if p < p0 (or if 2 >20)
then reject the null hypothesis.41
Summary of 2 test
• A very common method for significant test
• Many good tutorials online– Ex:
http://en.wikipedia.org/wiki/Chi-square_distribution
42
Hw4
43
Hw4
• Q1-Q3: kNN
• Q4: chi-square for feature selection
• Q5-Q6: The effect of feature selection on kNN
• Q7: Conclusion44
Q1-Q3: kNN
• The choice of k
• The choice of similarity function:– Euclidean distance: choose the smallest ones– Cosine function: choose the largest ones
• Binary vs. real-valued features
45
Q4-Q6
• Rank features by chi-square scores
• Remove non-relevant features from the vector files
• Run kNN using the newly processed data
• Compare the results with or without feature selection.
46