feature selection ling 572 fei xia week 4: 1/29/08 1

Feature selection

LING 572

Fei Xia

Week 4: 1/29/08

1

Creating attribute-value table

• Choose features:– Define feature templates

– Instantiate the feature templates– Dimensionality reduction: feature selection

• Feature weighting– The weight for fk: the whole column– The weight for fk in di: a cell

f1 f2 … fK y

x1

x2

…

2

An example: text classification task

• Define feature templates: – One template only: word

• Instantiate the feature templates– All the words appeared in the training (and test) data

• Dimensionality reduction: feature selection– Remove stop words

• Feature weighting– Feature value: term frequency (tf), or tf-idf

3

Outline

• Dimensionality reduction

• Some scoring functions **

• Chi-square score and Chi-square test

• Hw4

In this lecture, we will use “term” and “feature” interchangeably.

4

Dimensionality reduction (DR)

5

Dimensionality reduction (DR)

• What is DR?– Given a feature set r, create a new set r’, s.t.

• r’ is much smaller than r, and• the classification performance does not suffer too

much.

• Why DR?– ML algorithms do not scale well.– DR can reduce overfitting.

6

Types of DR

• r is the original feature set, r’ is the one after DR.

• Local DR vs. Global DR– Global DR: r’ is the same for every

category– Local DR: a different r’ for each category

• Term extraction vs. term selection7

Term selection vs. extraction

• Term selection: r’ is a subset of r– Wrapping methods: score terms by training and

evaluating classifiers. expensive and classifier-dependent

– Filtering methods

• Term extraction: terms in r’ are obtained by combinations or transformation of r terms.– Term clustering: – Latent semantic indexing (LSI)

8

Term selection by filtering

• Main idea: scoring terms according to predetermined numerical functions that measure the “importance” of the terms.

• It is fast and classifier-independent.

• Scoring functions:

– Information Gain– Mutual information– chi square– …

9

Quick summary so far

• DR: to reduce the number of features– Local DR vs. global DR– Term extraction vs. term selection

• Term extraction– Term clustering: – Latent semantic indexing (LSI)

• Term selection• Wrapping method• Filtering method: different functions

10

Some scoring functions

11

Basic distributions(treating features as binary)

Probability distributions on the event space of documents:

12

Calculating basic distributions

13

Term selection functions

• Intuition: for a category ci , the most valuable terms are those that are distributed most differently in the sets of possible and negative examples of ci.

14

Term selection functions

15

Information gain

• IG(Y|X): We must transmit Y. How many bits on average would it save us if both ends of the line knew X?

• Definition:

IG (Y, X) = H(Y) – H(Y|X)

16

Information gain

Information gain

17

More term selection functions**

18

More term selection functions**

19

Global DR

• For local DR, calculate f(tk, ci).

• For global DR, calculate one of the following:

20|C| is the number of classes

Which function works the best?

• It depends on– Classifiers– Data– …

• According to (Yang and Pedersen 1997):

21

Feature weighting

22

Alternative feature values

• Binary features: 0 or 1.

• Term frequency (TF): the number of times that tk appears in di.

• Inversed document frequency (IDF): log |D| /dk, where dk is the number of documents that contain tk.

• TFIDF = TF * IDF

• Normalized TFIDF:23

Feature weights

• Feature weight 2 {0,1}: same as DR

• Feature weight 2 R: iterative approach: – Ex: MaxEnt

Feature selection is a special case of feature weighting.

24

Summary so far

• Curse of dimensionality dimensionality reduction (DR)

• DR:– Term extraction– Term selection

• Wrapping method• Filtering method: different functions

25

Summary (cont)

• Functions:– Document frequency– Mutual information– Information gain– Gain ratio– Chi square– …

26

Chi square

27

Chi square

• An example: is gender a good feature for predicting footwear preference?– A: gender – B: footwear preference

• Bivariate tabular analysis: – Is there a relationship between two random variables

A and B in the data?– How strong is the relationship?– What is the direction of the relationship?

28

Raw frequencies

sandal sneaker Leather shoe

boots others

male 6 17 13 9 5

female 13 5 7 16 9

29

Feature: male/female

Classes: {sandal, sneaker, ….}

Two distributions

Sandal Sneaker Leather Boot Others Total

Male 6 17 13 9 5 50

Female 13 5 7 16 9 50

Total 19 22 20 25 14 100


Male 50

Female 50

Total 19 22 20 25 14 100

Observed distribution (O):

Expected distribution (E):

30

Two distributions


Male 6 17 13 9 5 50

Female 13 5 7 16 9 50

Total 19 22 20 25 14 100


Male 9.5 11 10 12.5 7 50

Female 9.5 11 10 12.5 7 50

Total 19 22 20 25 14 100

Observed distribution (O):

Expected distribution (E):

31

Chi square

• Expected value =

row total * column total / table total

• 2 = ij (Oij - Eij)2 / Eij

• 2 = (6-9.5)2/9.5 + (17-11)2/11+ ….

= 14.026

32

Calculating 2

• Fill out a contingency table of the observed values O

• Compute the row totals and column totals

• Calculate expected value for each cell assuming no association E

• Compute chi square: (O-E)2/E

33

When r=2 and c=2

O =

E =

34

2 test

35

Basic idea

• Null hypothesis (the tested hypothesis): no relation exists between two random variables.

• Calculate the probability of having the observation with that 2 value, assuming the hypothesis is true.

• If the probability is too small, reject the hypothesis.

36

Requirements

• The events are assumed to be independent and have the same distribution.

• The outcomes of each event must be mutually exclusive.

• At least 5 observations per cell.

• Collect raw frequencies, not percentages

37

Degree of freedom

• Degree of freedom df = (r – 1) (c – 1)

r: # of rows c: # of columns

• In this Ex: df=(2-1) (5-1)=4

38

2 distribution table

0.10 0.05 0.025 0.01 0.001

1 2.706 3.841 5.024 6.635 10.828

2 4.605 5.991 7.378 9.210 13.816

3 6.251 7.815 9.348 11.345 16.266

4 7.779 9.488 11.143 13.277 18.467

5 9.236 11.070 12.833 15.086 20.515

6 10.645 12.592 14.449 16.812 22.458

…

df=4 and 14.026 > 13.277 p<0.01 there is a significant relation

39

2 to P Calculator

http://faculty.vassar.edu/lowry/tabs.html#csq

40

http://faculty.vassar.edu/lowry/tabs.html#csq

Steps of 2 test

• Select significance level p0

• Calculate 2

• Compute the degree of freedom df = (r-1)(c-1)

• Calculate p given 2 value (or get the 20 for p0)

• if p < p0 (or if 2 >20)

then reject the null hypothesis.41

Summary of 2 test

• A very common method for significant test

• Many good tutorials online– Ex:

http://en.wikipedia.org/wiki/Chi-square_distribution

42



Hw4

43

Hw4

• Q1-Q3: kNN

• Q4: chi-square for feature selection

• Q5-Q6: The effect of feature selection on kNN

• Q7: Conclusion44

Q1-Q3: kNN

• The choice of k

• The choice of similarity function:– Euclidean distance: choose the smallest ones– Cosine function: choose the largest ones

• Binary vs. real-valued features

45

Q4-Q6

• Rank features by chi-square scores

• Remove non-relevant features from the vector files

• Run kNN using the newly processed data

• Compare the results with or without feature selection.

46

feature selection ling 572 fei xia week 4: 1/29/08 1

Documents