a framework to adjust dependency measure estimates for chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

SDM 2016 – May 6th 2016

A Framework to Adjust Dependency Measure Estimates for Chance

Simone Romano

[email protected]

@ialuronico

Nguyen Xuan Vinh James Bailey Karin Verspoor

(We won the Best Paper Award!)

Department of Computing and Information Systems,The University of Melbourne, Victoria, Australia

I will soon start working as

applied scientist for

in London UK

Simone Romano University of Melbourne



Motivation

Adjustment for Quantification

Adjustment for Ranking

Conclusions




Dependency Measures

A dependency measure D is used to assessthe amount of dependency between variables:

Example 1: After collecting weight and height for many people,we can compute D(weight, height)

Example 2: assess the amountof dependency between searchqueries in Google

https://www.google.com/

trends/correlate/

They are fundamental for a number of applications in machine learning/ data mining



https://www.google.com/trends/correlate/

https://www.google.com/trends/correlate/


Applications of Dependency Measures

Supervised learning

I Feature selection [Guyon and Elisseeff, 2003];

I Decision tree induction [Criminisi et al., 2012];

I Evaluation of classification accuracy [Witten et al., 2011].

Unsupervised learning

I External clustering validation [Strehl and Ghosh, 2003];

I Generation of alternative or multi-view clusterings[Muller et al., 2013, Dang and Bailey, 2015];

I The exploration of the clustering space using results from the Meta-Clusteringalgorithm [Caruana et al., 2006, Lei et al., 2014].

Exploratory analysis

I Inference of biological networks [Reshef et al., 2011, Villaverde et al., 2013];

I Analysis of neural time-series data [Cohen, 2014].




Motivation for Adjustment For QuantificationPearson’s correlation between two variables X and Y estimated on a data sampleSn = {(xk , yk)} of n data points:

r(Sn|X ,Y ) ,

∑nk=1(xk − x)(yk − y)√∑n

k=1(xk − x)2∑n

k=1(yk − y)2(1)

1 0.8 0.4 0 -0.4 -0.8 -1

1 1 1 -1 -1 -1

0 0 0 0 0 0 0

Figure : Fromhttps://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

r2(Sn|X ,Y ) can be used as a proxy of the amount of noise for linearrelationships:

I 1 if noiselessI 0 if complete noise



https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient


The Maximal Information Coefficient (MIC) was published in Science[Reshef et al., 2011] and has ≈ 570 citations to date according to Google scholar.

MIC(X ,Y ) can be used as a proxy of the amount fo noise for functionalrelationships:

Figure : From supplementary material online in [Reshef et al., 2011]

MIC should be equal to:

I 1 if the relationship between X and Y is functional and noiseless

I 0 if there is complete noise




Challenge

Nonetheless, its estimation is challenging on a finite data sample Sn of n datapoints.

We simulate 10,000 fully noisy relationships between X and Y on 20 and 80 datapoints:

0.2 0.4 0.6 0.8 1

MIC(S80jX;Y )

MIC(S20jX;Y )

Value can be high because of chance!The user expects values close to 0 in both cases

Challenge: Adjust the estimated MIC to better exploit the range [0, 1]




Adjustment for Chance

I We define a framework for adjustment:


AD ,D − E [D0]

max D − E [D0]

I It uses the distribution D0 under independent variables:I r2

0 : Beta distributionI MIC0: can be computed using Monte Carlo permutations.

Used in κ-statistics. Its application is beneficial to other dependency measures:

I Adjusted r2 ⇒ Ar2

I Adjusted MIC ⇒ AMIC




Adjusted measures enable better interpretabilityTask:Obtain 1 for noiseless relationship, and 0 for complete noise (on average).

0%

r2 = 1Ar2 = 1

20%

r2 = 0:66Ar2 = 0:65

40%

r2 = 0:39Ar2 = 0:37

60%

r2 = 0:2Ar2 = 0:17

80%

r2 = 0:073Ar2 = 0:044

100%

r2 = 0:035Ar2 = 0:00046

Figure : Ar2 becomes zero on average on 100% noise: r2 = 0.035 vs Ar2 = 0.00046.

0%

MIC = 1AMIC = 1

20%

MIC = 0:7AMIC = 0:6

40%

MIC = 0:47AMIC = 0:29

60%

MIC = 0:34AMIC = 0:11

80%

MIC = 0:27AMIC = 0:021

100%

MIC = 0:26AMIC = 0:0014

Figure : AMIC becomes zero on average on 100% noise: MIC = 0.26 vs AMIC = 0.014.




Not biased towards small sample size nAverage value of D for different % of noise⇒ estimates can be high because of chance at small n (e.g. because of missingvalues)

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 10

n = 20

n = 30

n = 40

n = 100

n = 200

Raw r2

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 20

n = 40

n = 60

n = 80

Raw MIC

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 10

n = 20

n = 30

n = 40

n = 100

n = 200

Ar2 (Adjusted)

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 20

n = 40

n = 60

n = 80

AMIC (Adjusted)




Not biased towards small sample size nAverage value of D for different % of noise⇒ estimates can be high because of chance at small n (e.g. because of missingvalues)

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 10

n = 20

n = 30

n = 40

n = 100

n = 200

Raw r2

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 20

n = 40

n = 60

n = 80

Raw MIC

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 10

n = 20

n = 30

n = 40

n = 100

n = 200

Ar2 (Adjusted)

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 20

n = 40

n = 60

n = 80

AMIC (Adjusted)Simone Romano University of Melbourne



Motivation



Conclusions




Motivation for Adjustment for Ranking

Say that we want to predict the risks of C cancer using equally unpredictive variablesX1 and X2 defined as follows:

I X1 ≡ patient had breakfast today, X1 = {yes, no};I X2 ≡ patient eye color, X2 = {green, blu, brown};

X1= yesX1= no

X2=green

X2=blueX2=brown

Problem:When rankingvariables, dependencymeasures are biasedtowards the selectionof variables with manycategories

This still happens because of finite samples!




Selection bias experimentExperimentn = 100 data pointsClass C with 2 categories:

I Generate a variable X1 with 2 categories(independently from C)


ComputeGini(X1,C) andGini(X2,C).

Give a win to the variablethat gets the highestvalue

REPEAT 10,000 times

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

X1 X2

Probability of Selection

Result: X2 gets selected 70% of the times

( Bad )Given that they are equally unpredictive,we expected 50%

Challenge: adjust the estimated Gini gain to obtained unbiased ranking




Selection bias experimentExperimentn = 100 data pointsClass C with 2 categories:



ComputeGini(X1,C) andGini(X2,C).

Give a win to the variablethat gets the highestvalue

REPEAT 10,000 times

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

X1 X2


Result: X2 gets selected 70% of the times

( Bad )Given that they are equally unpredictive,we expected 50%

Challenge: adjust the estimated Gini gain to obtained unbiased rankingSimone Romano University of Melbourne



Adjustment for RankingWe propose two adjustments for ranking:

Standardization

SD ,D − E [D0]√

Var(D0)

Quantifies statistical significance like a p-value


AD(α) , D − q0(1− α)

Penalizes on statistical significance according to α

q0 is the quantile of the distribution D0

(small α more penalization)Simone Romano University of Melbourne



Standardized Gini (SGini) corrects for Selection bias

Select unpredictive features X1 with 2 categories and X2 with 3 categories.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

X1 X2


Experiment: X1 and X2 gets se-lected on average almost 50% ofthe times

( Good )

Being similar to a p-value, this is consistent with the literature on decisiontrees [Frank and Witten, 1998, Dobra and Gehrke, 2001, Hothorn et al., 2006,Strobl et al., 2007].

Nonetheless: we found that this is a simplistic scenario




Standardized Gini (SGini) might be biased

Fix predictiveness of features X1 and X2 to a constant 6= 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

X1 X2


Experiment: SGini becomes bi-ased towards X1 because morestatically significant

( Bad )

This behavior has been overlooked in the decision tree community

Use AD(α) to penalize less or even tune the bias!

⇒ AGini(α)




Application to random forest

why random forest? good classifier to try first when there are “meaningful” features[Fernandez-Delgado et al., 2014].

Plug-in different splitting criteria

Experiment: 19 data sets with categorical variables

,0 0.2 0.4 0.6 0.8

Mea

nAU

C

90

90.5

91

91.5

AGini(,)

SGini

Gini

Figure : Using the same α for all data sets

And α can be tuned for each data set with cross-validation.




Motivation



Conclusions




Conclusion - Message

Dependency estimates are high because of chance under finite samples.

Adjustments can help for:

Quantification, to have an interpretable value between [0, 1]

Ranking, to avoid biases towards:

I missing values

I categorical variables with more categories

Future Work:Adjust dependency measures between multiple variables D(X1, . . . ,Xd) because ofbias towards large d




Thank you.

Questions?

Simone Romano

[email protected]

@ialuronico

Code available online:

https://github.com/ialuronico



https://github.com/ialuronico


References I

Caruana, R., Elhawary, M., Nguyen, N., and Smith, C. (2006).Meta clustering.In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 107–118. IEEE.

Cohen, M. X. (2014).Analyzing neural time series data: theory and practice.MIT Press.

Criminisi, A., Shotton, J., and Konukoglu, E. (2012).Decision forests: A unified framework for classification, regression, density estimation,manifold learning and semi-supervised learning.Foundations and Trends in Computer Graphics and Vision, 7(2-3):81–227.

Dang, X. H. and Bailey, J. (2015).A framework to uncover multiple alternative clusterings.Machine Learning, 98(1-2):7–30.

Dobra, A. and Gehrke, J. (2001).Bias correction in classification tree construction.In ICML, pages 90–97.

Fernandez-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014).Do we need hundreds of classifiers to solve real world classification problems?The Journal of Machine Learning Research, 15(1):3133–3181.




References II

Frank, E. and Witten, I. H. (1998).Using a permutation test for attribute selection in decision trees.In ICML, pages 152–160.

Guyon, I. and Elisseeff, A. (2003).An introduction to variable and feature selection.The Journal of Machine Learning Research, 3:1157–1182.

Hothorn, T., Hornik, K., and Zeileis, A. (2006).Unbiased recursive partitioning: A conditional inference framework.Journal of Computational and Graphical Statistics, 15(3):651–674.

Lei, Y., Vinh, N. X., Chan, J., and Bailey, J. (2014).Filta: Better view discovery from collections of clusterings via filtering.In Machine Learning and Knowledge Discovery in Databases, pages 145–160. Springer.

Muller, E., Gunnemann, S., Farber, I., and Seidl, T. (2013).Discovering multiple clustering solutions: Grouping objects in different views of the data.Tutorial at ICML.

Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh,P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011).Detecting novel associations in large data sets.Science, 334(6062):1518–1524.




References III

Strehl, A. and Ghosh, J. (2003).Cluster ensembles—a knowledge reuse framework for combining multiple partitions.The Journal of Machine Learning Research, 3:583–617.

Strobl, C., Boulesteix, A.-L., and Augustin, T. (2007).Unbiased split selection for classification trees based on the gini index.Computational Statistics & Data Analysis, 52(1):483–501.

Villaverde, A. F., Ross, J., and Banga, J. R. (2013).Reverse engineering cellular networks with information theoretic methods.Cells, 2(2):306–329.

Witten, I. H., Frank, E., and Hall, M. A. (2011).Data Mining: Practical Machine Learning Tools and Techniques.3rd edition.



a framework to adjust dependency measure estimates for chance

Engineering