feature selection for high-dimensional data: a fast correlation-based filter solution
DESCRIPTION
Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. Presented by Jingting Zeng 11/26/2007. Outline. Introduction to Feature Selection Feature Selection Models Fast Correlation-Based Filter ( FCBF) Algorithm Experiment Discussion Reference. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/1.jpg)
Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution
Presented by Jingting Zeng11/26/2007
![Page 2: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/2.jpg)
Outline
Introduction to Feature Selection Feature Selection Models Fast Correlation-Based Filter (FCBF)
Algorithm Experiment Discussion Reference
![Page 3: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/3.jpg)
Introduction of Feature Selection Definition
A process that chooses an optimal subset of features according to an objective function
Objectives To reduce dimensionality and remove noise To improve mining performance
Speed of learning Predictive accuracy Simplicity and comprehensibility of mined results
![Page 4: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/4.jpg)
An Example for Optimal Subset
Data set (whole set)Five Boolean featuresC = F1 F2∨F3= ┐F2 ,F5= ┐F4Optimal subset:
{F1, F2}or{F1, F3}
![Page 5: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/5.jpg)
Models of Feature Selection
Filter model Separating feature selection from classifier learning Relying on general characteristics of data
(information, distance, dependence, consistency) No bias toward any learning algorithm, fast
Wrapper model Relying on a predetermined classification algorithm Using predictive accuracy as goodness measure High accuracy, computationally expensive
![Page 6: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/6.jpg)
Filter Model
![Page 7: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/7.jpg)
Wrapper Model
![Page 8: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/8.jpg)
Two Aspects for Feature Selection How to decide whether a feature is
relevant to the class or not How to decide whether such a relevant
feature is redundant or not compared to other features
![Page 9: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/9.jpg)
Linear Correlation Coefficient
For a pair of variables (x,y):
However, it may not be able to capture the non-linear correlations
![Page 10: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/10.jpg)
Information Measures Entropy of variable X
Entropy of X after observing Y
Information Gain
Symmetrical Uncertainty
![Page 11: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/11.jpg)
Fast Correlation-Based Filter (FCBF) Algorithm How to decide whether a feature is
relevant to the class C or notFind a subset , such that
How to decide whether such a relevant feature is redundantUse the correlation of features and class as a
reference
'S
,', 1 ,i i cf S i N SU
if
![Page 12: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/12.jpg)
Definitions
Predominant CorrelationThe correlation between a feature and the
class C is predominant
Redundant peer (RP) If there is , is a RP of
Use to denote the set of RP for
, , ,',i c j j i i ciff SU and f S SU SU
if
, ,j i i cSU SU jf
iSp iS
if
if
![Page 13: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/13.jpg)
iSp
iSp
i
C
![Page 14: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/14.jpg)
Three Heuristics
If , treat as a predominant feature, remove all features in and skip identifying redundant peers for them
If , process all the features in at first. If non of them becomes predominant, follow the first heuristic
The feature with the largest value is always a predominant feature and can be a starting point to remove other features.
iSp
iSp
ifiSp
iSp
,i cSU
![Page 15: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/15.jpg)
iSp
iSp
i
C
![Page 16: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/16.jpg)
FCBF Algorithm
Time Complexity:
O(N)
![Page 17: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/17.jpg)
FCBF Algorithm (cont.)
Time complexity:
O(NlogN)
![Page 18: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/18.jpg)
Experiments
FCBF are compared to ReliefF, CorrSF and ConsSF
Summary of the 10 data sets
![Page 19: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/19.jpg)
Results
![Page 20: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/20.jpg)
Results (cont.)
![Page 21: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/21.jpg)
Pros and Cons
AdvantageVery fastSelect fewer features with higher accuracy
DisadvantageCannot detect some features
4 features generated by 4 Gaussian functions and adding 4 additional redundant features, FCBF selected only 3 features
![Page 22: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/22.jpg)
Discussion
FCBF compares only individual features with each other
Try to use PCA to capture a group of features. Based on the result, then the FCBF is used.
![Page 23: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/23.jpg)
Reference
L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc 12th Int Conf on Machine Learning (ICML-03), pages 856–863, 2003
Biesiada J, Duch W (2005), Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. (CORES'05) Advances in Soft Computing, Springer Verlag, pp. 95-104, 2005.
www.cse.msu.edu/~ptan/SDM07/Yu-Ye-Liu.pdf www1.cs.columbia.edu/~jebara/6772/proj/Keith.ppt
![Page 24: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](https://reader035.vdocuments.us/reader035/viewer/2022062408/5681436b550346895dafeae3/html5/thumbnails/24.jpg)
Thank you!
Q and A