computing variable importance in collinear regression ... · the shapley-owen values i it has been...

18
Computing variable importance in collinear regression models using Game Theory Giovanni Rabitti Department of Decision Sciences - Bocconi University, Milan Joint work with E. Borgonovo (Bocconi University) and E. Plischke (TU Clausthal)

Upload: others

Post on 05-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

Computing variable importance in collinearregression models using Game Theory

Giovanni Rabitti

Department of Decision Sciences - Bocconi University, Milan

Joint work with E. Borgonovo (Bocconi University) and E.Plischke (TU Clausthal)

Page 2: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

Motivation

Finding the most important explanatory variables in regressionmodels is not an easy task.

To rank them, could one compare

I the absolute value of coefficients? NO!

I the lowest p-values? NO!

This task is even more complicated in presence of collinearexplanatory variables.

What to do?

Page 3: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

The Shapley values

I The Shapley values come from cooperative game theory[Shapley, 1953].

I A coalition of players generate a value and we have to divideit so that each member receives his/her fair part of the valuethat he/she contributed to generate

I Consider a value function ν for any possible coalition T , thenthe Shapley value for the i-th player is:

φνi =∑

T⊆{1,2,...,n}\{i}

|T |!(n − |T | − 1)!

n!(ν(T ∪ {i})− ν(T )) .

I Very important property:∑n

i=1 φνi = ν({1, ..., n}).

Page 4: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

The Shapley-Owen values

I It has been studied by Owen (1972) and Grabisch andRoubens (1999) and proposed as a sensitivity analysis tool byRabitti and Borgonovo (2019).

I The Shapley-Owen value is a method for attributing value tothe interactions in a coalition of players in a game.

I The Shapley-Owen value for players of a coalition S withvalue function ν is denoted φνS and is defined by:

φνS =∑

T⊆N\S

(n − |T | − |S |)!|T |!(n − |S |+ 1)!

∑L⊆S

(−1)|S |−|L|ν(L ∪ T ).

I This index represents the residual interaction value of acoalition of players {i1, i2, ....iS}.

Page 5: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

Interpretation: the link between value and interactionConsider the case S = {i , j} wlog. Then:

φνi ,j =∑

T⊆N\{i ,j}

(n − |T | − 2)!|T |!(n − 1)!

IT ({i , j}) (1)

where

IT ({i , j}) = ν(T ∪ {i , j})− ν(T ∪ {i})− ν(T ∪ {j}) + ν(T )

I the sign and the magnitude of IT ({i , j}) averaged over allother possible coalitions T , should give information about theinteraction between the two players

I if the interaction is positive, then joining the coalition isprofitable and the value of the coalition is greater than thesum of the individual values (synergism)

I if the interaction is negative, the value of the coalition is lessthan the sum (antagonism)

Page 6: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

Shapley values for regression models

Lindeman, Merenda and Gold (1980); Lipovetsky and Konklin(2001) and other authors consider the regression problem

y = β1x1 + ...+ βdxd + ε

where the explanatory variables are correlated.

These authors consider the Shapley values φLMG with

ν(T ) = R2T

where R2T is the goodness-of-fit measure of the submodel

y = βi1xi1 + ...+ βiT xiT + ε.

Page 7: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

A new algorithm

We consider the representation of Shapley value as

φMOBi =

∑i∈T

m(T )

|T |

where m(T ) =∑

S⊆T (−1)|T−S |ν(S) is the Moebius inverse of thevalue function.

The computational aspects using this representation areinvestigated in Plischke, Rabitti and Borgonovo (2020).

We apply this approach to the problem

y = β1x1 + ...+ βdxd + β12x1x2 + ...+ βd−1,dxd−1xd + ε

with value function ν(T ) =∑

j∈T βjxj +∑{i ,j}⊆T βijxixj .

Page 8: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

Test Case 1: Comparison of the algorithms

We consider the function from Gromping (2007, 2009):

y = 4x1 + x2 + x3 + 0.3x4 + ε

with ε ∼ (0, 2) and x ∼ N(0,Σ) where the variances are 1 and thecovariances ρ|i−j |, for i , j = 1, 2, 3, 4.

The parameter ρ goes from -0.9 to 0.9 by 0.1.

Page 9: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

Shapley values using the two algorithms

Page 10: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

Accuracy of the two algorithms

N = 100 replicates at ρ = −0.9. Difference in the sum of thevariances: ∼ 10%.

Page 11: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

Shapley-Owen values

Page 12: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

Test Case 2: Insights on interactions

We consider the function of Saltelli and Tarantola (2002)

y = x1 + x2 + x3 + ax2x3

to which we have added a small noise ε ∼ N(0, 0.5).The variables are normal with mean 0 and variance 1. x1 and x2are correlated according to ρ.

Saltelli and Tarantola (2002) write that when a 6= 0 ”interactionmay be carried over by correlation”.

Using Shapley-Owen values, we can quantify this effect.

Page 13: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

Shapley-Owen values for Saltelli-Tarantola function

Page 14: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

Application: NAIC expanses dataset

We consider the dataset of the expenses of insurance companies inUS in 2005 (Frees, 2010).

This dataset contains a sample of 384 US insurance companiesreports from the National Association of Insurance Commissioners(NAIC).

The aim is to study the insurance expenses in terms of 13 variablesrepresenting business, financial and actuarial indicators of thecompanies.

The linear model plus interactions fits very well (R2 = 0.976).There are strong correlations among variables.

Page 15: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

Correlations among explanatory variables

Page 16: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

Shapley values (with bootstrap confindence intervals)

Page 17: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

Shapley-Owen values

Page 18: Computing variable importance in collinear regression ... · The Shapley-Owen values I It has been studied by Owen (1972) and Grabisch and Roubens (1999) and proposed as a sensitivity

Conclusions

I Shapley and Shapley-Owen values are promising tools toanalyze variables importance in presence of correlations.

I We have proposed a new algorithm to calculate them andcompared to the existing algorithm (which can only deal withShapley values).

I Further research is needed in this field.