lecture 5: customer lifecycle management – classification ... · classification problems...

Business Data Analytics

Lecture 5: Customer Lifecycle Management –Classification Problems

MTAT.03.319

The slides are available under creative common license. The original owner of these slides is the University of Tartu.

1

Outline

• Classification and its applications to CLM

• Classification with logistic regression

• Classification with decision trees

• Black-box classification models

• Measuring the quality of classification models

• Pitfalls: class imbalance & overfitting

2

Unsupervised learning Supervised learning

Feature 1

Fea

ture

1

Fea

ture

2

Feature 1

3

Raw Dataset

Regressor

/

Class

10

55

23

12

Features

Sam

ples

Raw Dataset

Classifier

Features

Sam

ples

Quantity

Supervised LearningRegression vs. classification

Sample

Sample

Class probability (confidence)

0.78

38Preprocessing

Preprocessing Training

Training Prediction

Prediction

Note: A classifier can be trained to classify each sample into more than two classes.4

Classification in Customer Lifecycle Management

Lead propensity Cross-sell/up-sell propensity

Response modeling / buying

propensity5

Applying classification

• We are a charity. We have a database containing 100K donors who have not donated in the past 12 months. We know their basic demographics, address and how much they have donated in past (and when).

• Sending a Christmas post card asking for donation costs 60 cents/piece. When we mail out, the average donation (when there is one) comes at about 2 euros.

• Should we send a (postal) mail to all 100K donors?

• How can a classifier help me answer this question?

6

Classification in Customer Lifecycle Management

Lead propensity Fault/Complaint prediction

Churn propensity

Cross-sell/up-sell propensity

Response modeling / buying

propensity

Win-back propensity

7

event-basedsubscription-based

What is churn?

8

Why care?

For most companies, the customer acquisition cost (cost of acquiring a new customer) is multiple times higher than the cost

of retaining an existing customer

Bringing a new customer to the level of profitability as an old customer takes time

Churners/switchers often can be avoided with proactive measures

9

How to decrease churn?

Long-termTactical intervention

Short-termOperational intervention

10

How to decrease short-term churn?

Recurrent payments (automatic subscription renewal)

Reminder emails (e.g. account expiration)

Providing extra services and knowledge

(e.g. figure out where clients have troubles and provide help with that)

Use scoring model to determine the intervention type and moment

Use scoring model to determine the intervention type and moment

11

is not regression!*

Logistic regression

it is classification

* it is called so due to its similarity to linear regression12

means that y is binary: (0,1)

Binary logistic regression

13


• sigmoid means S-shaped

• also known as squashing function since it maps the line to [0,1],

• which is necessary if the output needs to be interpreted as probability

14


“0”

“1”

If we threshold the output at 0.5, we create

a decision rule of the form

15

logit <- glm(data=train, as.factor(danger)~waiting, family='binomial')test$predictions_probability <- predict(newdata=test, logit, type = 'response')test$predictions_binary <- ifelse(test$predictions_probability<=0.5, 0, 1)table(real=test$danger, predictions=test$predictions_binary)

predicted valuereal 0 1

0 34 01 2 64

Binary logistic regression: example in R

16

Binary logistic regression: example in R

logit <- glm(data=train, as.factor(danger)~waiting, family='binomial')summary(logit)

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -42.5845 11.5448 -3.689 0.000225 ***waiting 0.6380 0.1728 3.692 0.000223 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Coefficients are difficult to interpret directly: they are logarithm of odds

We can take exponent of coefficients: exp(coef(logit))

(Intercept) waiting "0.000" "1.893"

17

An increase of waiting by

one unit increases the odds

of “danger” by 1.893

Outline

• Classification and applications to CLM






18

Decision Trees

Minivan

Age

Car Type

YES NO

YES

<30 >=30

Sports, Truck

0 30 60 Age

YES

YES

NO

Minivan

Sports,Truck

Decision Tree: Split Selection

• Numerical or ordered attributes: Find a split point that separates the (two) classes

(Yes: No: )

30 35

Age

Decision Tree: Split Selection

Categorical attributes: How to group?

Sport: Truck: Minivan:

(Sport, Truck): (Minivan):

(Sport): (Truck, Minivan):

(Sport, Minivan): (Truck)

Which split should we pick?Hint: the one with the higher “purity” impurity measure

Decision Tree

Advantages

1. Easy to Understand

2. Useful in Data exploration

3. Very fast to calculate (even for large datasets)

Disadvantages

1. Low accuracy (like logistic regression)

2. Over fitting (more on this later)

22

Input

Output

Input

Output

White-Box vs. Black-Box Models

Logistic regression

Decision tree

Ensemble methods:

Random Forests

Gradient boosting (e.g. XGBoost)

23

Ensemble methods• Idea: Train multiple classifiers on subsets of the

data or subsets of features and have them compete

• Examples:

• Random forests

24

Random forest

25

3 key ideas:1. Tree Bagging: each tree built with a random subset of data2. Each tree is built with a random subset of the features3. Voting

Ensemble methods• Idea: Train multiple classifiers on subsets of the

data or subsets of features and have them compete

• Examples:

• Random forests

• XGBoost, LightGBM (state of the art)

• Why is it powerful?

• Because we can play around with the parameters, e.g. number of trees, depth, etc.

This is called “hyperparameter optimization”

26

Outline







27

Quality of Classifiers

28

Quality of classifiers

29

Calculate a confusion matrix

It depends on the class probability threshold!It depends on the class probability threshold!


30

Actual classProbability of positive classFeatures

Threshold= 0.5i.e., if prediction >= 0.5

then churn (1)else no churn (0)




Actual/Predicted 1 (pos) 0 (neg)

1 (pos) 2 1

0 (neg) 0 2

31


confusion matrix


Actual/Predicted 1 (P) 0 (N)

1 (P) 2 1

0 (N) 0 2

Precision = 2/3

2 3

32

Recall = 2/2

32







Actual/Predicted 1 (pos) 0 (neg)

1 (pos) 2 0

0 (neg) 0 3

33


Now: what is the precision and the recall?


• Accuracy: Ratio of correctly predicted observation to the total observations

• Accuracy = (TP+TN)/(TP+FP+FN+TN)

• Intuitive, but terrible in case of class imbalance

• F Score: Weighted average of Precision and Recall.

• F-Score = 2*(Recall * Precision) / (Recall + Precision)

• More resilient to class imbalance

34

Outline







35

Class imbalance

A credit card company collects data about OK vs fraudulent transactions

OK transactions make up 99.999% of transactions

We train a classifier to classify transactions into OK and fraudulent

We get an accuracy of 99.999%

Why?

Class imbalance

always predict

not-fraud

99.997% accuracy

99.997% precision

0 recall!

Why does the classifier learn this?

Because twisting the classifier to catch the bad guys, makes us also “catch” some good guys

… so no visible improvement

Slide by: David Kauchak

Dealing with class imbalance

38

• Use a quality measure that is as robust as possible

• F-Score is OK in extreme cases (like the previous one)

• Less robust in more subtle cases, e.g. 90-10 class imbalance

• ROC and AUC (next) are better…

• Area Under the Precision Recall Curve even better (AUPRC)

ROC Curve and AUC: Another metric robust against imbalanced dataset

ROC – Robust Measurement of Classifier Quality

True Positive Rate (TPR) = Sensitivity =

False Positive Rate (FPR) = fp/(fp + tn)

Let’s plot the ROC curve for the following example

TP

R

FPR

01

1

0.5

0.25

0.75

0.25 0.5 0.75

39

random guess

both our model, and the second one predict correctly

when they are confident in their scores

model1: Area under the curve: 0.9991


model2 makes

mistakes earlier

Intuition behind ROC

40

random guess

both our model, and the second one predict correctly

when they are confident in their scores



model2 makes

mistakes earlier

Intuition behind ROC

AUCAUC

41

Dealing within class imbalance

• Use a quality measure that is as robust as possible

• F-Score is OK in extreme cases (like the previous one)

• Less robust in more subtle cases, e.g. 90-10 class imbalance

• ROC and AUC (next) are better…

• Area Under the Precision Recall Curve even better (AUPRC)

• Undersampling

• Deliberately throw away training data

• E.g. we have 6000 positive samples, 500 negative for training

• We randomly select 500 of the 6000 positive samples

• Oversampling

• Synthetic generation of minority class data, e.g. SMOTE42

http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf

Under- and overfitting

43

How to detect overfitting

• Compute AUC (or F-score) on the training set and on the testing set

• The larger the diff, the more likely overfitting has taken place

• E.g. AUC on training = 0.97, AUC on testing = 0.85

Detecting overfitting (more rigorously)

Source: http://karlrosaen.com/ml/learning-log/2016-06-20/

K-Folds cross validation

Step1: We split our data into K different splits.

Step2: We use (K-1) folds for training and Kth for testing.

Step3: We do the step 2 for every different K set.

Step4: Finally we do average of the model quality across all the models.

The measure you get is more reflective of the likely performance in practice…

E = Output (Error or Accuracy) of your model

45

Practicetime!

https://courses.cs.ut.ee/2018/bda/fall/Main/Practice

46

In case you need more…

TheoryLogistic regression: https://www.youtube.com/watch?v=Z5WKQr4H4XkDecision Trees: https://www.youtube.com/watch?v=opQ49Xr748kRandom Forest: https://www.youtube.com/watch?v=gmmV4drPTS4

PracticalLogistic regression: https://www.youtube.com/watch?v=nubin7hq4-sDecision Tree: https://www.youtube.com/watch?v=JFJIQ0_2ijg&t=1026sRandom Forest: https://www.youtube.com/watch?v=nVMw7fTlj4o

47

lecture 5: customer lifecycle management – classification ... · classification problems...

Documents