supervised multivariate discretization and levels merging ... · 4/34 currentpractice job home time...

Supervised multivariate discretization and levelsmerging for logistic regression

Adrien Ehrhardt1,2

Christophe Biernacki2

Philippe Heinrich3,4

Vincent Vandewalle2,31Crédit Agricole Consumer Finance

2Inria Lille - Nord-Europe3Université de Lille

31/08/2018

Table of Contents

Context and basic notations

Supervised multivariate discretization and factor levels grouping

Interactions in logistic regression

Conclusion and future work

Context and basic notations

Current practice

Job Home Time injob

Family status Wages

Repayment

Craftsman Owner 20 Widower 2000

? Renter 10 Common-law 1700

Licensed profes-sional

Starter 5 Divorced 4000

Executive By work 8 Single 2700

Office employee Renter 12 Married 1400

Worker By family 2 ? 1200

Table: Dataset with outliers and missing values.

1. Feature selection2. Discretization / grouping3. Interaction screening4. Logistic regression fitting

Current practice

Job Home Time injob

Family status Wages

Repayment

Craftsman Owner 20 Widower 2000

? Renter 10 Common-law 1700

Starter 5 Divorced 4000

Executive By work 8 Single 2700

Office employee Renter 12 Married 1400

Worker By family 2 ? 1200

Current practice

Job Family status Wages

Repayment

Craftsman Widower 2000

? Common-law 1700

Divorced 4000

Executive Single 2700

Office employee Married 1400

Worker ? 1200

Current practice

Repayment

Craftsman Widower ]1500;2000]

? Common-law ]1500;2000]

Divorced ]2000;∞[

Executive Single ]2000;∞[

Office employee Married ]-∞ ; 1500]

Worker ? ]-∞ ; 1500]

Current practice

Repayment

?+Low-qualified ?+Alone ]1500;2000]

?+Low-qualified Union ]1500;2000]

High-qualified ?+Alone ]2000;∞[

?+Low-qualified Union ]-∞ ; 1500]

?+Low-qualified ?+Alone ]-∞ ; 1500]

Current practice

Job Family status x Wages

Repayment

?+Low-qualified ?+Alone x ]1500;2000]

?+Low-qualified Union x ]1500;2000]

High-qualified ?+Alone x ]2000;∞[

?+Low-qualified Union x ]-∞ ; 1500]

?+Low-qualified ?+Alone x ]-∞ ; 1500]

Current practice

Job Family status x WagesScore

Repayment

?+Low-qualified ?+Alone x ]1500;2000]225

?+Low-qualified Union x ]1500;2000]190

High-qualified ?+Alone x ]2000;∞[218

High-qualified ?+Alone x ]2000;∞[202

?+Low-qualified Union x ]-∞ ; 1500]205

?+Low-qualified ?+Alone x ]-∞ ; 1500]192

Mathematical reinterpretation

The whole process can be decomposed into two steps:

X → E → Yx 7→ e = f (x) 7→ y

Selected features: x = ((xj)d11︸︷︷︸

, (xj)dd1+1︸︷︷︸

∈{1,...,oj}

f must be “simple” and “component-wise”, i.e. f = (fj)d1 .

We restrict to discretization and grouping of factor levels.

X → E → Yx 7→ e = f (x) 7→ y

, (xj)dd1+1︸︷︷︸

∈{1,...,oj}

X → E → Yx 7→ e = f (x) 7→ y

, (xj)dd1+1︸︷︷︸

∈{1,...,oj}

Mathematical reinterpretation: Feature Engineering

xjfj(xj) = 1 fj(xj) = 2 fj(xj) = 3

Discretization (1 ≤ j ≤ d1)

Into m intervals with associated cutpoints c = (c1, . . . , cm−1).

Discretization function

fj(·; c ,m) : R→{1, . . . ,m}

x 7→1]−∞;c1](x) +m−2∑k=1

(k + 1) 1]ck ;ck+1](x)

+m 1]cm−1,∞[(x)

xjfj(xj) = 1 fj(xj) = 2 fj(xj) = 3

Discretization (1 ≤ j ≤ d1)

Into m intervals with associated cutpoints c = (c1, . . . , cm−1).

Discretization function

fj(·; c ,m) : R→{1, . . . ,m}

x 7→1]−∞;c1](x) +m−2∑k=1

(k + 1) 1]ck ;ck+1](x)

+m 1]cm−1,∞[(x)

1 2 3 4 5

fj(xj) =

Grouping (d1 < j ≤ d)

Grouping o values into m, m ≤ o.

Grouping function

fj : {1, . . . , o} → {1, . . . ,m}fj surjective: it defines a partition of {1, . . . , o} in m elements.

1 2 3 4 5

fj(xj) =

Grouping (d1 < j ≤ d)

Grouping o values into m, m ≤ o.

Grouping function

fj : {1, . . . , o} → {1, . . . ,m}fj surjective: it defines a partition of {1, . . . , o} in m elements.

Mathematical reinterpretation: Objective

Target feature y ∈ {0, 1} must be predicted given engineeredfeatures f (x) = (fj(xj))

We restrict to binary logistic regression.

On “raw” data, logistic regression yields:

logit(pθraw(1|x)) = θ0 +

d1∑j=1

θjxj +d∑

j=d1+1

On discretized / grouped data, logistic regression yields:

logit(pθf (1|f (x))) = θ0 +d∑

θfj (xj )j

d1∑j=1

θjxj +d∑

j=d1+1

θfj (xj )j

d1∑j=1

θjxj +d∑

j=d1+1

θfj (xj )j

d1∑j=1

θjxj +d∑

j=d1+1

θfj (xj )j

Example

True data

logit(ptrue(1|x)) = ln

(ptrue(1|x)

1− ptrue(1|x)

)= sin((x1 − 0.7)× 7)

0.0 0.2 0.4 0.6 0.8 1.0

logit(p(1|x))

True distribution

Figure: True relationship between predictor and outcome

Example

Logistic regression on “raw” data:

logit(pθraw(1|x)) = θ0 + θ1x1

0.0 0.2 0.4 0.6 0.8 1.0

logit(p(1|x))

True distributionLinear logistic regression

Figure: Linear logistic regression fit

Example

Logistic regression on discretized data:If f is not carefully chosen . . .

logit(pθf (1|f (x))) = θ0 + θf1(x1)1︸︷︷︸

θ11 ,...,θ501

0.0 0.2 0.4 0.6 0.8 1.0

logit(p(1|x))

True distributionBad discretization

Figure: Bad (high variance) discretization

Example

Logistic regression on discretized data:If f is carefully chosen . . .

logit(pθf (1|f (x))) = θ0 + θf1(x1)1︸︷︷︸

θ11 ,...,θ31

0.0 0.2 0.4 0.6 0.8 1.0

logit(p(1|x))

True distributionGood discretization

Figure: Good (bias/variance tradeoff) discretization

Criterion

θ can be estimated for each discretization f and f ? can be chosenthrough our favorite model choice criterion: BIC, AIC, . . .

A model selection problem

(f ?,θ?) = argmaxf ∈F ,θ∈Θf

n∑i=1

ln pθ(yi |f (x i ))− penalty(n;θ)

How to efficiently explore F?

Criterion

n∑i=1

Criterion

n∑i=1

Exploring F

Example of discretization

“Functional” space F where f lives is continuous:

xjfj(xj) = 1 fj(xj) = 2 fj(xj) = mj

However, for a fixed design x = (x i )n1 there is a countable

space F̃ in which fRg ⇔ ∀i , j , fj(xi ) = gj(xi )

(f ?,θ?) = argmaxf ∈F̃ ,θ∈Θf

n∑i=1

Exploring F

n∑i=1

Exploring F

n∑i=1

Exploring F

xjfj(xj) = 1 fj(xj) = 2

n∑i=1

Exploring F

xjgj(xj) = 1 gj(xj) = 2

n∑i=1

Exploring F

xjhj(xj) = 1 hj(xj) = 2

n∑i=1

Exploring F

n∑i=1

State-of-the art

Current academic methods:

A lot of existing heuristics, see [Ramírez-Gallego et al., 2016]:

State-of-the art

Most of these methods are:

I Univariate,

I Test statistics more or less justified (χ2-based).

State-of-the art

Most of these methods are:

I Univariate,

I Test statistics more or less justified (χ2-based).

Supervised multivariate discretization and factorlevels grouping

Mathematical formalization

Discretized / grouped xj denoted by ej has been seen up to now asthe result of a function of xj :

ej = fj(xj).

Discretization / grouping ej can be seen as a latent randomvariable for which

p(ej |xj) = 1ej (fj(xj))︸︷︷︸Heaviside-like functiondifficult to optimize

Suppose for now that m = (mj)d1 is fixed.