a framework to adjust dependency measure estimates for chance
TRANSCRIPT
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
SDM 2016 – May 6th 2016
A Framework to Adjust Dependency Measure Estimates for Chance
Simone Romano
@ialuronico
Nguyen Xuan Vinh James Bailey Karin Verspoor
(We won the Best Paper Award!)
Department of Computing and Information Systems,The University of Melbourne, Victoria, Australia
I will soon start working as
applied scientist for
in London UK
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Motivation
Adjustment for Quantification
Adjustment for Ranking
Conclusions
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Dependency Measures
A dependency measure D is used to assessthe amount of dependency between variables:
Example 1: After collecting weight and height for many people,we can compute D(weight, height)
Example 2: assess the amountof dependency between searchqueries in Google
https://www.google.com/
trends/correlate/
They are fundamental for a number of applications in machine learning/ data mining
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Applications of Dependency Measures
Supervised learning
I Feature selection [Guyon and Elisseeff, 2003];
I Decision tree induction [Criminisi et al., 2012];
I Evaluation of classification accuracy [Witten et al., 2011].
Unsupervised learning
I External clustering validation [Strehl and Ghosh, 2003];
I Generation of alternative or multi-view clusterings[Muller et al., 2013, Dang and Bailey, 2015];
I The exploration of the clustering space using results from the Meta-Clusteringalgorithm [Caruana et al., 2006, Lei et al., 2014].
Exploratory analysis
I Inference of biological networks [Reshef et al., 2011, Villaverde et al., 2013];
I Analysis of neural time-series data [Cohen, 2014].
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Motivation for Adjustment For QuantificationPearson’s correlation between two variables X and Y estimated on a data sampleSn = {(xk , yk)} of n data points:
r(Sn|X ,Y ) ,
∑nk=1(xk − x)(yk − y)√∑n
k=1(xk − x)2∑n
k=1(yk − y)2(1)
1 0.8 0.4 0 -0.4 -0.8 -1
1 1 1 -1 -1 -1
0 0 0 0 0 0 0
Figure : Fromhttps://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
r2(Sn|X ,Y ) can be used as a proxy of the amount of noise for linearrelationships:
I 1 if noiselessI 0 if complete noise
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
The Maximal Information Coefficient (MIC) was published in Science[Reshef et al., 2011] and has ≈ 570 citations to date according to Google scholar.
MIC(X ,Y ) can be used as a proxy of the amount fo noise for functionalrelationships:
Figure : From supplementary material online in [Reshef et al., 2011]
MIC should be equal to:
I 1 if the relationship between X and Y is functional and noiseless
I 0 if there is complete noise
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Challenge
Nonetheless, its estimation is challenging on a finite data sample Sn of n datapoints.
We simulate 10,000 fully noisy relationships between X and Y on 20 and 80 datapoints:
0.2 0.4 0.6 0.8 1
MIC(S80jX;Y )
MIC(S20jX;Y )
Value can be high because of chance!The user expects values close to 0 in both cases
Challenge: Adjust the estimated MIC to better exploit the range [0, 1]
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Adjustment for Chance
I We define a framework for adjustment:
Adjustment for Quantification
AD ,D − E [D0]
max D − E [D0]
I It uses the distribution D0 under independent variables:I r2
0 : Beta distributionI MIC0: can be computed using Monte Carlo permutations.
Used in κ-statistics. Its application is beneficial to other dependency measures:
I Adjusted r2 ⇒ Ar2
I Adjusted MIC ⇒ AMIC
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Adjusted measures enable better interpretabilityTask:Obtain 1 for noiseless relationship, and 0 for complete noise (on average).
0%
r2 = 1Ar2 = 1
20%
r2 = 0:66Ar2 = 0:65
40%
r2 = 0:39Ar2 = 0:37
60%
r2 = 0:2Ar2 = 0:17
80%
r2 = 0:073Ar2 = 0:044
100%
r2 = 0:035Ar2 = 0:00046
Figure : Ar2 becomes zero on average on 100% noise: r2 = 0.035 vs Ar2 = 0.00046.
0%
MIC = 1AMIC = 1
20%
MIC = 0:7AMIC = 0:6
40%
MIC = 0:47AMIC = 0:29
60%
MIC = 0:34AMIC = 0:11
80%
MIC = 0:27AMIC = 0:021
100%
MIC = 0:26AMIC = 0:0014
Figure : AMIC becomes zero on average on 100% noise: MIC = 0.26 vs AMIC = 0.014.
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Not biased towards small sample size nAverage value of D for different % of noise⇒ estimates can be high because of chance at small n (e.g. because of missingvalues)
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Raw r2
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
Raw MIC
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Ar2 (Adjusted)
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
AMIC (Adjusted)
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Not biased towards small sample size nAverage value of D for different % of noise⇒ estimates can be high because of chance at small n (e.g. because of missingvalues)
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Raw r2
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
Raw MIC
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Ar2 (Adjusted)
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
AMIC (Adjusted)Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Motivation
Adjustment for Quantification
Adjustment for Ranking
Conclusions
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Motivation for Adjustment for Ranking
Say that we want to predict the risks of C cancer using equally unpredictive variablesX1 and X2 defined as follows:
I X1 ≡ patient had breakfast today, X1 = {yes, no};I X2 ≡ patient eye color, X2 = {green, blu, brown};
X1= yesX1= no
X2=green
X2=blueX2=brown
Problem:When rankingvariables, dependencymeasures are biasedtowards the selectionof variables with manycategories
This still happens because of finite samples!
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Motivation for Adjustment for Ranking
Say that we want to predict the risks of C cancer using equally unpredictive variablesX1 and X2 defined as follows:
I X1 ≡ patient had breakfast today, X1 = {yes, no};I X2 ≡ patient eye color, X2 = {green, blu, brown};
X1= yesX1= no
X2=green
X2=blueX2=brown
Problem:When rankingvariables, dependencymeasures are biasedtowards the selectionof variables with manycategories
This still happens because of finite samples!
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Selection bias experimentExperimentn = 100 data pointsClass C with 2 categories:
I Generate a variable X1 with 2 categories(independently from C)
I Generate a variable X2 with 3 categories(independently from C)
ComputeGini(X1,C) andGini(X2,C).
Give a win to the variablethat gets the highestvalue
REPEAT 10,000 times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Result: X2 gets selected 70% of the times
( Bad )Given that they are equally unpredictive,we expected 50%
Challenge: adjust the estimated Gini gain to obtained unbiased ranking
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Selection bias experimentExperimentn = 100 data pointsClass C with 2 categories:
I Generate a variable X1 with 2 categories(independently from C)
I Generate a variable X2 with 3 categories(independently from C)
ComputeGini(X1,C) andGini(X2,C).
Give a win to the variablethat gets the highestvalue
REPEAT 10,000 times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Result: X2 gets selected 70% of the times
( Bad )Given that they are equally unpredictive,we expected 50%
Challenge: adjust the estimated Gini gain to obtained unbiased ranking
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Selection bias experimentExperimentn = 100 data pointsClass C with 2 categories:
I Generate a variable X1 with 2 categories(independently from C)
I Generate a variable X2 with 3 categories(independently from C)
ComputeGini(X1,C) andGini(X2,C).
Give a win to the variablethat gets the highestvalue
REPEAT 10,000 times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Result: X2 gets selected 70% of the times
( Bad )Given that they are equally unpredictive,we expected 50%
Challenge: adjust the estimated Gini gain to obtained unbiased ranking
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Selection bias experimentExperimentn = 100 data pointsClass C with 2 categories:
I Generate a variable X1 with 2 categories(independently from C)
I Generate a variable X2 with 3 categories(independently from C)
ComputeGini(X1,C) andGini(X2,C).
Give a win to the variablethat gets the highestvalue
REPEAT 10,000 times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Result: X2 gets selected 70% of the times
( Bad )Given that they are equally unpredictive,we expected 50%
Challenge: adjust the estimated Gini gain to obtained unbiased ranking
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Selection bias experimentExperimentn = 100 data pointsClass C with 2 categories:
I Generate a variable X1 with 2 categories(independently from C)
I Generate a variable X2 with 3 categories(independently from C)
ComputeGini(X1,C) andGini(X2,C).
Give a win to the variablethat gets the highestvalue
REPEAT 10,000 times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Result: X2 gets selected 70% of the times
( Bad )Given that they are equally unpredictive,we expected 50%
Challenge: adjust the estimated Gini gain to obtained unbiased rankingSimone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Adjustment for RankingWe propose two adjustments for ranking:
Standardization
SD ,D − E [D0]√
Var(D0)
Quantifies statistical significance like a p-value
Adjustment for Ranking
AD(α) , D − q0(1− α)
Penalizes on statistical significance according to α
q0 is the quantile of the distribution D0
(small α more penalization)Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Standardized Gini (SGini) corrects for Selection bias
Select unpredictive features X1 with 2 categories and X2 with 3 categories.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Experiment: X1 and X2 gets se-lected on average almost 50% ofthe times
( Good )
Being similar to a p-value, this is consistent with the literature on decisiontrees [Frank and Witten, 1998, Dobra and Gehrke, 2001, Hothorn et al., 2006,Strobl et al., 2007].
Nonetheless: we found that this is a simplistic scenario
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Standardized Gini (SGini) might be biased
Fix predictiveness of features X1 and X2 to a constant 6= 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Experiment: SGini becomes bi-ased towards X1 because morestatically significant
( Bad )
This behavior has been overlooked in the decision tree community
Use AD(α) to penalize less or even tune the bias!
⇒ AGini(α)
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Standardized Gini (SGini) might be biased
Fix predictiveness of features X1 and X2 to a constant 6= 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Experiment: SGini becomes bi-ased towards X1 because morestatically significant
( Bad )
This behavior has been overlooked in the decision tree community
Use AD(α) to penalize less or even tune the bias!
⇒ AGini(α)
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Application to random forest
why random forest? good classifier to try first when there are “meaningful” features[Fernandez-Delgado et al., 2014].
Plug-in different splitting criteria
Experiment: 19 data sets with categorical variables
,0 0.2 0.4 0.6 0.8
Mea
nAU
C
90
90.5
91
91.5
AGini(,)
SGini
Gini
Figure : Using the same α for all data sets
And α can be tuned for each data set with cross-validation.
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Motivation
Adjustment for Quantification
Adjustment for Ranking
Conclusions
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Conclusion - Message
Dependency estimates are high because of chance under finite samples.
Adjustments can help for:
Quantification, to have an interpretable value between [0, 1]
Ranking, to avoid biases towards:
I missing values
I categorical variables with more categories
Future Work:Adjust dependency measures between multiple variables D(X1, . . . ,Xd) because ofbias towards large d
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Thank you.
Questions?
Simone Romano
@ialuronico
Code available online:
https://github.com/ialuronico
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
References I
Caruana, R., Elhawary, M., Nguyen, N., and Smith, C. (2006).Meta clustering.In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 107–118. IEEE.
Cohen, M. X. (2014).Analyzing neural time series data: theory and practice.MIT Press.
Criminisi, A., Shotton, J., and Konukoglu, E. (2012).Decision forests: A unified framework for classification, regression, density estimation,manifold learning and semi-supervised learning.Foundations and Trends in Computer Graphics and Vision, 7(2-3):81–227.
Dang, X. H. and Bailey, J. (2015).A framework to uncover multiple alternative clusterings.Machine Learning, 98(1-2):7–30.
Dobra, A. and Gehrke, J. (2001).Bias correction in classification tree construction.In ICML, pages 90–97.
Fernandez-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014).Do we need hundreds of classifiers to solve real world classification problems?The Journal of Machine Learning Research, 15(1):3133–3181.
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
References II
Frank, E. and Witten, I. H. (1998).Using a permutation test for attribute selection in decision trees.In ICML, pages 152–160.
Guyon, I. and Elisseeff, A. (2003).An introduction to variable and feature selection.The Journal of Machine Learning Research, 3:1157–1182.
Hothorn, T., Hornik, K., and Zeileis, A. (2006).Unbiased recursive partitioning: A conditional inference framework.Journal of Computational and Graphical Statistics, 15(3):651–674.
Lei, Y., Vinh, N. X., Chan, J., and Bailey, J. (2014).Filta: Better view discovery from collections of clusterings via filtering.In Machine Learning and Knowledge Discovery in Databases, pages 145–160. Springer.
Muller, E., Gunnemann, S., Farber, I., and Seidl, T. (2013).Discovering multiple clustering solutions: Grouping objects in different views of the data.Tutorial at ICML.
Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh,P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011).Detecting novel associations in large data sets.Science, 334(6062):1518–1524.
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
References III
Strehl, A. and Ghosh, J. (2003).Cluster ensembles—a knowledge reuse framework for combining multiple partitions.The Journal of Machine Learning Research, 3:583–617.
Strobl, C., Boulesteix, A.-L., and Augustin, T. (2007).Unbiased split selection for classification trees based on the gini index.Computational Statistics & Data Analysis, 52(1):483–501.
Villaverde, A. F., Ross, J., and Banga, J. R. (2013).Reverse engineering cellular networks with information theoretic methods.Cells, 2(2):306–329.
Witten, I. H., Frank, E., and Hall, M. A. (2011).Data Mining: Practical Machine Learning Tools and Techniques.3rd edition.
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance