lecture 5: customer lifecycle management – classification ... · classification problems...
TRANSCRIPT
Business Data Analytics
Lecture 5: Customer Lifecycle Management –Classification Problems
MTAT.03.319
The slides are available under creative common license. The original owner of these slides is the University of Tartu.
1
Outline
• Classification and its applications to CLM
• Classification with logistic regression
• Classification with decision trees
• Black-box classification models
• Measuring the quality of classification models
• Pitfalls: class imbalance & overfitting
2
Unsupervised learning Supervised learning
Feature 1
Fea
ture
1
Fea
ture
2
Feature 1
3
Raw Dataset
Regressor
/
Class
10
55
23
12
Features
Sam
ples
Raw Dataset
Classifier
Features
Sam
ples
Quantity
Supervised LearningRegression vs. classification
Sample
Sample
Class probability (confidence)
0.78
38Preprocessing
Preprocessing Training
Training Prediction
Prediction
Note: A classifier can be trained to classify each sample into more than two classes.4
Classification in Customer Lifecycle Management
Lead propensity Cross-sell/up-sell propensity
Response modeling / buying
propensity5
Applying classification
• We are a charity. We have a database containing 100K donors who have not donated in the past 12 months. We know their basic demographics, address and how much they have donated in past (and when).
• Sending a Christmas post card asking for donation costs 60 cents/piece. When we mail out, the average donation (when there is one) comes at about 2 euros.
• Should we send a (postal) mail to all 100K donors?
• How can a classifier help me answer this question?
6
Classification in Customer Lifecycle Management
Lead propensity Fault/Complaint prediction
Churn propensity
Cross-sell/up-sell propensity
Response modeling / buying
propensity
Win-back propensity
7
event-basedsubscription-based
What is churn?
8
Why care?
For most companies, the customer acquisition cost (cost of acquiring a new customer) is multiple times higher than the cost
of retaining an existing customer
Bringing a new customer to the level of profitability as an old customer takes time
Churners/switchers often can be avoided with proactive measures
9
How to decrease churn?
Long-termTactical intervention
Short-termOperational intervention
10
How to decrease short-term churn?
Recurrent payments (automatic subscription renewal)
Reminder emails (e.g. account expiration)
Providing extra services and knowledge
(e.g. figure out where clients have troubles and provide help with that)
Use scoring model to determine the intervention type and moment
Use scoring model to determine the intervention type and moment
11
is not regression!*
Logistic regression
it is classification
* it is called so due to its similarity to linear regression12
means that y is binary: (0,1)
Binary logistic regression
13
Binary logistic regression
• sigmoid means S-shaped
• also known as squashing function since it maps the line to [0,1],
• which is necessary if the output needs to be interpreted as probability
14
Binary logistic regression
“0”
“1”
If we threshold the output at 0.5, we create
a decision rule of the form
15
logit <- glm(data=train, as.factor(danger)~waiting, family='binomial')test$predictions_probability <- predict(newdata=test, logit, type = 'response')test$predictions_binary <- ifelse(test$predictions_probability<=0.5, 0, 1)table(real=test$danger, predictions=test$predictions_binary)
predicted valuereal 0 1
0 34 01 2 64
Binary logistic regression: example in R
16
Binary logistic regression: example in R
logit <- glm(data=train, as.factor(danger)~waiting, family='binomial')summary(logit)
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) -42.5845 11.5448 -3.689 0.000225 ***waiting 0.6380 0.1728 3.692 0.000223 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Coefficients are difficult to interpret directly: they are logarithm of odds
We can take exponent of coefficients: exp(coef(logit))
(Intercept) waiting "0.000" "1.893"
17
An increase of waiting by
one unit increases the odds
of “danger” by 1.893
Outline
• Classification and applications to CLM
• Classification with logistic regression
• Classification with decision trees
• Black-box classification models
• Measuring the quality of classification models
• Pitfalls: class imbalance & overfitting
18
Decision Trees
Minivan
Age
Car Type
YES NO
YES
<30 >=30
Sports, Truck
0 30 60 Age
YES
YES
NO
Minivan
Sports,Truck
Decision Tree: Split Selection
• Numerical or ordered attributes: Find a split point that separates the (two) classes
(Yes: No: )
30 35
Age
Decision Tree: Split Selection
Categorical attributes: How to group?
Sport: Truck: Minivan:
(Sport, Truck): (Minivan):
(Sport): (Truck, Minivan):
(Sport, Minivan): (Truck)
Which split should we pick?Hint: the one with the higher “purity” impurity measure
Decision Tree
Advantages
1. Easy to Understand
2. Useful in Data exploration
3. Very fast to calculate (even for large datasets)
Disadvantages
1. Low accuracy (like logistic regression)
2. Over fitting (more on this later)
22
Input
Output
Input
Output
White-Box vs. Black-Box Models
Logistic regression
Decision tree
Ensemble methods:
Random Forests
Gradient boosting (e.g. XGBoost)
23
Ensemble methods• Idea: Train multiple classifiers on subsets of the
data or subsets of features and have them compete
• Examples:
• Random forests
24
Random forest
25
3 key ideas:1. Tree Bagging: each tree built with a random subset of data2. Each tree is built with a random subset of the features3. Voting
Ensemble methods• Idea: Train multiple classifiers on subsets of the
data or subsets of features and have them compete
• Examples:
• Random forests
• XGBoost, LightGBM (state of the art)
• Why is it powerful?
• Because we can play around with the parameters, e.g. number of trees, depth, etc.
This is called “hyperparameter optimization”
26
Outline
• Classification and applications to CLM
• Classification with logistic regression
• Classification with decision trees
• Black-box classification models
• Measuring the quality of classification models
• Pitfalls: class imbalance & overfitting
27
Quality of Classifiers
28
Quality of classifiers
29
Calculate a confusion matrix
It depends on the class probability threshold!It depends on the class probability threshold!
Quality of Classifiers
30
Actual classProbability of positive classFeatures
Threshold= 0.5i.e., if prediction >= 0.5
then churn (1)else no churn (0)
Threshold= 0.5i.e., if prediction >= 0.5
then churn (1)else no churn (0)
Quality of Classifiers
Actual/Predicted 1 (pos) 0 (neg)
1 (pos) 2 1
0 (neg) 0 2
31
Actual classProbability of positive classFeatures
confusion matrix
Quality of Classifiers
Actual/Predicted 1 (P) 0 (N)
1 (P) 2 1
0 (N) 0 2
Precision = 2/3
2 3
32
Recall = 2/2
32
Actual classProbability of positive classFeatures
Threshold= 0.8i.e., if prediction >= 0.8
then churn (1)else no churn (0)
Threshold= 0.8i.e., if prediction >= 0.8
then churn (1)else no churn (0)
Quality of Classifiers
Actual/Predicted 1 (pos) 0 (neg)
1 (pos) 2 0
0 (neg) 0 3
33
Actual classProbability of positive classFeatures
Now: what is the precision and the recall?
Quality of Classifiers
• Accuracy: Ratio of correctly predicted observation to the total observations
• Accuracy = (TP+TN)/(TP+FP+FN+TN)
• Intuitive, but terrible in case of class imbalance
• F Score: Weighted average of Precision and Recall.
• F-Score = 2*(Recall * Precision) / (Recall + Precision)
• More resilient to class imbalance
34
Outline
• Classification and applications to CLM
• Classification with logistic regression
• Classification with decision trees
• Black-box classification models
• Measuring the quality of classification models
• Pitfalls: class imbalance & overfitting
35
Class imbalance
A credit card company collects data about OK vs fraudulent transactions
OK transactions make up 99.999% of transactions
We train a classifier to classify transactions into OK and fraudulent
We get an accuracy of 99.999%
Why?
Class imbalance
always predict
not-fraud
99.997% accuracy
99.997% precision
0 recall!
Why does the classifier learn this?
Because twisting the classifier to catch the bad guys, makes us also “catch” some good guys
… so no visible improvement
Slide by: David Kauchak
Dealing with class imbalance
38
• Use a quality measure that is as robust as possible
• F-Score is OK in extreme cases (like the previous one)
• Less robust in more subtle cases, e.g. 90-10 class imbalance
• ROC and AUC (next) are better…
• Area Under the Precision Recall Curve even better (AUPRC)
ROC Curve and AUC: Another metric robust against imbalanced dataset
ROC – Robust Measurement of Classifier Quality
True Positive Rate (TPR) = Sensitivity =
False Positive Rate (FPR) = fp/(fp + tn)
Let’s plot the ROC curve for the following example
TP
R
FPR
01
1
0.5
0.25
0.75
0.25 0.5 0.75
39
random guess
both our model, and the second one predict correctly
when they are confident in their scores
model1: Area under the curve: 0.9991
model2: Area under the curve: 0.8365
model2 makes
mistakes earlier
Intuition behind ROC
40
random guess
both our model, and the second one predict correctly
when they are confident in their scores
model1: Area under the curve: 0.9991
model2: Area under the curve: 0.8365
model2 makes
mistakes earlier
Intuition behind ROC
AUCAUC
41
Dealing within class imbalance
• Use a quality measure that is as robust as possible
• F-Score is OK in extreme cases (like the previous one)
• Less robust in more subtle cases, e.g. 90-10 class imbalance
• ROC and AUC (next) are better…
• Area Under the Precision Recall Curve even better (AUPRC)
• Undersampling
• Deliberately throw away training data
• E.g. we have 6000 positive samples, 500 negative for training
• We randomly select 500 of the 6000 positive samples
• Oversampling
• Synthetic generation of minority class data, e.g. SMOTE42
http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf
Under- and overfitting
43
How to detect overfitting
• Compute AUC (or F-score) on the training set and on the testing set
• The larger the diff, the more likely overfitting has taken place
• E.g. AUC on training = 0.97, AUC on testing = 0.85
Detecting overfitting (more rigorously)
Source: http://karlrosaen.com/ml/learning-log/2016-06-20/
K-Folds cross validation
Step1: We split our data into K different splits.
Step2: We use (K-1) folds for training and Kth for testing.
Step3: We do the step 2 for every different K set.
Step4: Finally we do average of the model quality across all the models.
The measure you get is more reflective of the likely performance in practice…
E = Output (Error or Accuracy) of your model
45
Practicetime!
https://courses.cs.ut.ee/2018/bda/fall/Main/Practice
46
In case you need more…
TheoryLogistic regression: https://www.youtube.com/watch?v=Z5WKQr4H4XkDecision Trees: https://www.youtube.com/watch?v=opQ49Xr748kRandom Forest: https://www.youtube.com/watch?v=gmmV4drPTS4
PracticalLogistic regression: https://www.youtube.com/watch?v=nubin7hq4-sDecision Tree: https://www.youtube.com/watch?v=JFJIQ0_2ijg&t=1026sRandom Forest: https://www.youtube.com/watch?v=nVMw7fTlj4o
47