feature selection methods from correlation to causality isabelle guyon [email protected] nips...
Post on 21-Dec-2015
220 views
TRANSCRIPT
![Page 1: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/1.jpg)
Feature selection methodsfrom correlation
to causality
Isabelle Guyon [email protected]
NIPS 2008 workshop on kernel learning
![Page 2: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/2.jpg)
1) Feature Extraction, Foundations and ApplicationsI. Guyon, S. Gunn, et al.Springer, 2006.http://clopinet.com/fextract-book
2) Causal feature selectionI. Guyon, C. Aliferis, A. ElisseeffTo appear in “Computational Methods of Feature Selection”, Huan Liu and Hiroshi Motoda Eds., Chapman and Hall/CRC Press, 2007.http://clopinet.com/causality
Acknowledgements and references
![Page 3: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/3.jpg)
http://clopinet.com/causality
Constantin Aliferis Alexander Statnikov
André Elisseeff Jean-Philippe Pellet
Gregory F. Cooper Peter Spirtes
![Page 4: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/4.jpg)
Introduction
![Page 5: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/5.jpg)
Feature Selection
• Thousands to millions of low level features: select the most relevant one to build better, faster, and easier to understand learning machines.
X
n
m
n’
![Page 6: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/6.jpg)
Applications
Bioinformatics
Quality control
Machine vision
Customer knowledge
variables/features
examples
10
102
103
104
105
OCRHWR
MarketAnalysis
TextCategorization
Syst
em d
iagn
osis
10 102 103 104 105
106
![Page 7: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/7.jpg)
Nomenclature
• Univariate method: considers one variable (feature) at a time.
• Multivariate method: considers subsets of variables (features) together.
• Filter method: ranks features or feature subsets independently of the predictor (classifier).
• Wrapper method: uses a classifier to assess features or feature subsets.
![Page 8: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/8.jpg)
Univariate Filter
Methods
![Page 9: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/9.jpg)
Univariate feature ranking
• Normally distributed classes, equal variance 2 unknown; estimated from data as 2
within.
• Null hypothesis H0: + = -
• T statistic: If H0 is true,
t= (+ - -)/(withinm++1/m-Studentm++m--d.f.
-1
- +
- +
P(Xi|Y=-1)
P(Xi|Y=1)
xi
![Page 10: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/10.jpg)
• H0: X and Y are independent.
• Relevance index test statistic.
• Pvalue false positive rate FPR = nfp / nirr
• Multiple testing problem: use Bonferroni correction pval n pval
• False discovery rate: FDR = nfp / nsc FPR n/nsc
• Probe method: FPR nsp/np
pvalr0 r
Null distribution
Statistical tests ( chap. 2)
(Guyon, Dreyfus, 2006, )
![Page 11: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/11.jpg)
Univariate Dependence
• Independence:
P(X, Y) = P(X) P(Y)
• Measure of dependence:
MI(X, Y) = P(X,Y) log dX dY
= KL( P(X,Y) || P(X)P(Y) )
P(X,Y)
P(X)P(Y)
![Page 12: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/12.jpg)
A choice of feature selection ranking methods depending on the nature of:
• the variables and the target (binary, categorical, continuous)
• the problem (dependencies between variables, linear/non-linear relationships between variables and target)
• the available data (number of examples and number of variables, noise in data)
• the available tabulated statistics.
Other criteria ( chap. 3)
(Wlodzislaw Duch, 2006)
![Page 13: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/13.jpg)
Multivariate Methods
![Page 14: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/14.jpg)
Univariate selection may fail
Guyon-Elisseeff, JMLR 2004; Springer 2006
![Page 15: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/15.jpg)
Filters,Wrappers, andEmbedded methods
All features FilterFeature subset Predictor
All features
Wrapper
Multiple Feature subsets
Predictor
All featuresEmbedded
method
Feature subset
Predictor
![Page 16: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/16.jpg)
Relief
nearest hit
nearest miss
Dhit Dmiss
Relief=<Dmiss/Dhit>
Dhit
Dmiss
Kira and Rendell, 1992
![Page 17: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/17.jpg)
Wrappers for feature selection
N features, 2N possible feature subsets!
Kohavi-John, 1997
![Page 18: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/18.jpg)
• Exhaustive search.• Stochastic search (simulated annealing,
genetic algorithms)• Beam search: keep k best path at each
step. • Greedy search: forward selection or
backward elimination.• Floating search: Alternate forward and
backward strategies.
Search Strategies ( chap. 4)
(Juha Reunanen, 2006)
![Page 19: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/19.jpg)
Forward Selection (wrapper)
n
n-1
n-2
1
…
Start
Also referred to as SFS: Sequential Forward Selection
![Page 20: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/20.jpg)
Guided search: we do not consider alternative paths.Typical ex.: Gram-Schmidt orthog. and tree classifiers.
Forward Selection (embedded)
…
Start
n
n-1
n-2
1
![Page 21: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/21.jpg)
Backward Elimination (wrapper)
1
n-2
n-1
n
…
Start
Also referred to as SBS: Sequential Backward Selection
![Page 22: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/22.jpg)
Backward Elimination (embedded)
…
Start
1
n-2
n-1
n
Guided search: we do not consider alternative paths.Typical ex.: “recursive feature elimination” RFE-SVM.
![Page 23: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/23.jpg)
Scaling Factors
Idea: Transform a discrete space into a continuous space.
• Discrete indicators of feature presence: i {0, 1}
• Continuous scaling factors: i IR
=[1, 2, 3, 4]
Now we can do gradient descent!
![Page 24: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/24.jpg)
• Many learning algorithms are cast into a minimization of some regularized functional:
Empirical errorRegularization
capacity control
Formalism ( chap. 5)
(Lal, Chapelle, Weston, Elisseeff, 2006)
Justification of RFE and many other embedded methods.
![Page 25: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/25.jpg)
Embedded method
• Embedded methods are a good inspiration to design new feature selection techniques for your own algorithms:– Find a functional that represents your prior knowledge about
what a good model is.– Add the weights into the functional and make sure it’s
either differentiable or you can perform a sensitivity analysis efficiently
– Optimize alternatively according to and – Use early stopping (validation set) or your own stopping
criterion to stop and select the subset of features
• Embedded methods are therefore not too far from wrapper techniques and can be extended to multiclass, regression, etc…
![Page 26: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/26.jpg)
Causality
![Page 27: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/27.jpg)
What can go wrong?
Guyon-Aliferis-Elisseeff, 2007
X2 X1
180 190 200 210 220 230 240 250 260
20
40
60
80
100
120
![Page 28: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/28.jpg)
What can go wrong?
20 40 60 80 100
8
10
12
14
16
20
40
60
80
100
X2 X1
X1
X
2
X2 X1
180 190 200 210 220 230 240 250 260
20
40
60
80
100
120
![Page 29: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/29.jpg)
What can go wrong?
Guyon-Aliferis-Elisseeff, 2007
X2 Y
X1
![Page 30: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/30.jpg)
Lung Cancer
Smoking Genetics
Coughing
AttentionDisorder
Allergy
Anxiety Peer Pressure
Yellow Fingers
Car Accident
Born an Even Day
Fatigue
Local causal graph
![Page 31: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/31.jpg)
What works and why?
![Page 32: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/32.jpg)
Bilevel optimization
1) For each feature subset, train predictor on training data.
2) Select the feature subset, which performs best on validation data.– Repeat and average if you
want to reduce variance (cross-validation).
3) Test on test data.
N variables/features
M s
ampl
es
m1
m2
m3
Split data into 3 sets:training, validation, and test set.
![Page 33: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/33.jpg)
Complexity of Feature Selection
Method Number of subsets tried
Complexity C
Exhaustive search wrapper
2N N
Nested subsets Feature ranking
N(N+1)/2 or N
log N
Generalization_error Validation_error + (C/m2)
m2: number of validation examples, N: total number of features,n: feature subset size.
With high probability:
n
Error
Try to keep C of the order of m2.
![Page 34: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/34.jpg)
Insensitivity to irrelevant features
Simple univariate predictive model, binary target and features, all relevant features correlate perfectly with the target, all irrelevant features randomly drawn. With 98% confidence, abs(feat_weight) < w and i wixi < v.
ng number of “good” (relevant) features
nb number of “bad” (irrelevant) features
m number of training examples.
![Page 35: Feature selection methods from correlation to causality Isabelle Guyon isabelle@clopinet.com NIPS 2008 workshop on kernel learning](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a34c37/html5/thumbnails/35.jpg)
Conclusion
• Feature selection focuses on uncovering subsets of variables X1, X2, … predictive of the target Y.
• Multivariate feature selection is in principle more powerful than univariate feature selection, but not always in practice.
• Taking a closer look at the type of dependencies in terms of causal relationships may help refining the notion of variable relevance.
• Feature selection and causal discovery may be more harmful than useful.
• Causality can help ML but ML can also help causality