Datamining:Techniques and Applications in
Economics
Rob Potharst
Econometric Institute
Outline of this lecture
Datamining for ICT & Economics, 2
Part 1: Intelligent Decisions in Direct Mailing
Part 2: Brand Choice using Ensemble Methods
Part 3: Ensemble techniques for Choice Problems, especially Churn
Part 1Intelligent Decisions
in Direct Mailing
Rob Potharst, Uzay Kaymak, Wim Pijls Erasmus University Rotterdam
Faculty of Economics, Dept. of Computer Science
Jedid-Jah Jonker, SCP and Nanda Piersma, HES
Datamining for ICT & Economics, part 1: Direct Mailing
4
Outline
• Decision problems in direct mailing
• The charity organization case
• Target selection– models: logreg, CHAID, neural networks, association
rules, fuzzy modelling
• The frequency problem– models: MDP, reinforcement learning
(italic: CI methods)
Datamining for ICT & Economics, part 1: Direct Mailing
5
Classical literature
• Optimal mailing policies:Bitran & Mondschein (1996),Mailing Decisions in the Catalog Sales Industry
• on Target Selection:Bult & Wansbeek (1995),Optimal Selection for Direct Mail
Datamining for ICT & Economics, part 1: Direct Mailing
6
This part of the lecture is based on:
• R.Potharst, U.Kaymak & W.Pijls (2001),Neural Networks for Target Selection in Direct Marketing
• W.Pijls, R. Potharst & U.Kaymak (2001),Pattern-based Target Selection Applied to Fund Raising (2001)
• U.Kaymak (2001), Fuzzy Target Selection using RFM variables
• J.J.Jonker, N.Piersma & R.Potharst (2002),Direct Mailing Decisions for a Dutch Fundraiser
http://www.few.eur.nl/few/people/potharst/
Datamining for ICT & Economics, part 1: Direct Mailing
7
Thanks to:• Jedid-Jah Jonker
(Soc.Cult.Planb., DenHaag)• Uzay Kaymak
(Erasmus University, R’dam)• Nanda Piersma
(HES, A’dam)• Wim Pijls
(Erasmus University, R’dam)• an anonymous charity organization
Datamining for ICT & Economics, part 1: Direct Mailing
8
Decisions in direct mailing
• Target Selection: To which addresses are we going to send the next mailing?
• Frequency:How often are we going to send a mailing to each separate address?
• Inventory Size:How many items of each product should we have on stock?
• etc.
Datamining for ICT & Economics, part 1: Direct Mailing
9
Charity case
• A large Dutch charity organization
• Goal: to stimulate social and scientific research on a frequent disease
• More than 700 000 supporters
• Annual budget larger than 15M euro
• Multiple mailing campaigns a year, asking for donations
Datamining for ICT & Economics, part 1: Direct Mailing
10
Database • Information about over 700000 supporters• About 675000 considered for mailings• Supporter’s donation history is traced after
first-ever donation (cumulative database)• Recorded data (about 0.5 GB)
– mailing dates– donation amount– donation time– administrative data
Datamining for ICT & Economics, part 1: Direct Mailing
11
Target selection
• Problem from (direct) marketing
• Generation of customer profiles (models) who could be interested in a product
• Models built by analyzing data from similar (previous) campaigns
• Classification problem– separate positive cases from negative cases
and determine their characteristics
Datamining for ICT & Economics, part 1: Direct Mailing
12
Target selection cycle
customersconceptualization
test campaign data gathering
target selection
purchase
product
model
Datamining for ICT & Economics, part 1: Direct Mailing
13
Charity donations
• Charity organizations have supporters who donate money for the good cause
• Invite supporters to donate through several mailings per year
• Charity organizations may have different strategies for mailing supporters
• Select those supporters who are likely to donate in a particular mailing
Datamining for ICT & Economics, part 1: Direct Mailing
14
Target selection for supporters
supporters
data gathering,past donation behavior
target selection
more donations
model
Datamining for ICT & Economics, part 1: Direct Mailing
15
Target selection models
• Segmentation based, e.g. CHAID– divide customer base into disjoint segments
– select most promising segments
– segments assumed to be homogeneous
• Scoring based, e.g. logistic regression– score each customer in the customer base
– select customers with highest scores
– individual approach
Datamining for ICT & Economics, part 1: Direct Mailing
16
Gain chart
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fraction selected
Fra
ctio
n of
res
pond
ers
ideal typicalrandom
0.22040
)20( tG
5.12030
)20( eG
Datamining for ICT & Economics, part 1: Direct Mailing
17
Hit probability chart
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fraction selected
Res
pons
e fr
actio
n
ideal typicalrandom
Datamining for ICT & Economics, part 1: Direct Mailing
18
Data sources
• External databases: rental list– maintained by specialized companies– household-specific information– demographic information at ZIP code level
• Internal databases: house list– maintained by the company itself– traces purchase history of customer– most reliable and relevant information about
the customer
Datamining for ICT & Economics, part 1: Direct Mailing
19
RFM variables
• RecencyHow recent was the last purchase?E.g. number of days since last purchase
• FrequencyHow frequent are the purchases?E.g fraction of responded mailings
• Monetary valueHow much has the customer spent?E.g. average spending per mailing
Datamining for ICT & Economics, part 1: Direct Mailing
20
Feature selection
• RFM variables– often appropriate to capture specifics of
customers– relatively small number of variables– not suitable for identifying new or future
prospects
• feature selection (and sometimes reduction) still needed to select most relevant variables
Datamining for ICT & Economics, part 1: Direct Mailing
21
Why neural networks?
• Neural networks can hopefully be used for
building good target selection models that
can predict likely charity supporters
successfully
• Performance might be better than
segmentation models like CHAID, and
scoring methods like logistic regression
Datamining for ICT & Economics, part 1: Direct Mailing
22
Feature selection
• R1=Number of weeks since last response
• R2=Number of months since first-ever donation
• F1=Fraction of responded mailings
• F2=Response time for last response
• M1=Average donated amount per mailing
• M2=Last donated amount
• M3=Average donation per year
23
Data preparation
• Data set selection– which previous mailing to use for modeling?– influence of mailing strategy– select most recent full mailings (1998,1999)
• Data set size– about 5000 randomly selected supporters– independent training and test sets– training set 1998 - 4057 samples
test set 1998 - 4080 samplestraining set 1999 - 4111 samplestest set 1999 - 4131 samples
Datamining for ICT & Economics, part 1: Direct Mailing
24
Feedforward neural network
input layer hidden layer output layer
• 7 inputs• 1 hidden layer• 4 hidden neurons• 1 output
logistic
linear • normalized inputs and outputs
• initial weights random in (-0.1,0.1)
Datamining for ICT & Economics, part 1: Direct Mailing
25
Results on 1999 data set
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fraction selected
Fra
ctio
n of
res
pond
ers
idealnn trained on 1998 datann trained on 1999 datarandom
Datamining for ICT & Economics, part 1: Direct Mailing
26
Results on 1999 data set
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
Fraction selected
Re
spo
nse
fra
ctio
nnn trained on 1998 datann trained on 1999 data
Datamining for ICT & Economics, part 1: Direct Mailing
27
NN vs. logistic regression
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fraction selected
Fra
ctio
n r
esp
on
ded
idealneural networklogistic regressionrandom
Training set 1998, test set 1999
Datamining for ICT & Economics, part 1: Direct Mailing
28
NN vs. logistic regression
Training set 1998, test set 1999
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
Fraction selected
Res
pons
e fr
actio
nneural networklogistic regression
Datamining for ICT & Economics, part 1: Direct Mailing
29
Neural network vs. CHAID
Training set 1998, test set 1998
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fraction selected
Fra
ctio
n of
res
pond
ers
idealneural networkCHAIDrandom
Datamining for ICT & Economics, part 1: Direct Mailing
30
Conclusions
• Neural networks can be used to build target selection models successfully
• They outperform segmentation methods like CHAID, but performance is comparable to statistical regression methods
• There is evidence that a neural network model can be used for target selection in multiple mailing campaigns
Datamining for ICT & Economics, part 1: Direct Mailing
31
Why patterns/association rules?Scoringmethods
Segmentationmethods
Response rate + +/-
Interpretability - +
Question: Is it possible to have + , + ?
Answer: this study! = pattern-based
Datamining for ICT & Economics, part 1: Direct Mailing
32
Patterns and their support
a b c resp1 1 3 11 1 3 11 2 1 01 2 1 01 2 3 11 2 3 02 1 3 12 1 3 13 1 2 04 2 1 04 2 1 14 2 3 04 2 3 0
pattern support #responses
b = 2, c = 3 4 1
c = 2 1 0
a = 1, b = 1, c = 3 2 2
Datamining for ICT & Economics, part 1: Direct Mailing
33
Definitions
• a pattern is a set of attribute/value combinations
• a record R is a supporter of a pattern P if all attr/val combinations of P match those of R– Example: (3,1,2) is a supporter of ( b = 1, c = 2 )
• the support of a pattern P is the number of supporters of P
Datamining for ICT & Economics, part 1: Direct Mailing
34
Frequent patterns
• Given a minimum support minsup a pattern P is said to be frequent if
support( P ) minsup
• The set of frequent patterns can be represented by a trie
• An algorithm for finding frequent itemsets (like Apriori by Agrawal c.s.) can also be used to find frequent patterns
Datamining for ICT & Economics, part 1: Direct Mailing
35
The trie of frequent patterns
Datamining for ICT & Economics, part 1: Direct Mailing
36
Support and response counts
Datamining for ICT & Economics, part 1: Direct Mailing
37
With response rates
Datamining for ICT & Economics, part 1: Direct Mailing
38
Selecting the target groupa b c1 1 23 1 22 2 23 1 31 2 13 2 34 2 23 2 13 1 34 1 2
mrr10080
10010050
37,52525
10080
The first record (1,1,2) matches the following freq.patterns:
( a = 1 ) => resp. rate = 50 %
( b = 1 ) => resp. rate = 80 %
( a = 1, b = 2 ) => resp. rate = 100 % => max (mrr)
1 1 22 2 23 1 33 1 33 1 2
Target group:
Datamining for ICT & Economics, part 1: Direct Mailing
39
PatSelect
Input: a set of records
Output: a subset of size n: the target group
1. For all records R in the given set do:
• let P be the set of all frequent patterns that match R
• let mrr( R ) = max { resp.rate ( P ) | P in P }
2. Sort all records according to decreasing mrr
3. Select the topmost n records
Datamining for ICT & Economics, part 1: Direct Mailing
40
Fund raising application
• Dutch charity organization
• more than 700 000 supporters
• 26 mailing campaigns (dates, targets, responses)
• spread over six years (‘94 - ‘99)
• database of over 400 MB
Datamining for ICT & Economics, part 1: Direct Mailing
41
Research questions
1) How to select a target group with as high a response rate as possible, on the basis of history data
2) How to select a target group with as high a total amount donated as possible, again on the basis of history data
This study: question 1.
Datamining for ICT & Economics, part 1: Direct Mailing
42
RFM features
R1: # weeks since last response
R2: # months since first donation
F1: fraction of mailings supporter has responded to
F2: median response time of supporter
M1: etc.
Datamining for ICT & Economics, part 1: Direct Mailing
43
Model construction
• Choose only full mailing campaigns 98/99
• random split:– training set 50 %– test set 50 %
• resulting datasets:– tr98, tr99– test98, test99– each somewhat less than 200 000 cases!!
Datamining for ICT & Economics, part 1: Direct Mailing
44
Results‘99, trained on‘98 data
Datamining for ICT & Economics, part 1: Direct Mailing
45
Results‘99, trained on‘99 data
Datamining for ICT & Economics, part 1: Direct Mailing
46
Datamining for ICT & Economics, part 1: Direct Mailing
47
Comparison
• Neither a pure scoring, nor a pure segmentation method
• not segments, since patterns can be overlapping!
• many patterns => many different scores => performance comparable with scoring methods
• but also:
Datamining for ICT & Economics, part 1: Direct Mailing
48
Interpretability
high, since each supporter’s presence in the
target group can be explained by its inclusion
in a pattern with high response rate!!!
Datamining for ICT & Economics, part 1: Direct Mailing
49
Conclusions
• New method based on patterns and association rule algorithms with following characteristics:– response rate high– interpretability high
• interesting method, especially for large databases
Datamining for ICT & Economics, part 1: Direct Mailing
50
Why fuzzy?
Advantages of fuzzy target selection
models in marketing
• prediction power larger than conventional
statistical models
• large degree of transparency due to the
linguistic rules that can be derived from
data
Datamining for ICT & Economics, part 1: Direct Mailing
51
Fuzzy target selection
• FCM clustering in feature product space
• Average response rate
per cluster
• Score per customer
• Customer segmentation
• Rule derivation
}1,0{,1
1
kN
k ik
Nk kik
i ru
ru
Ci ik
Ci iik
ku
us
1
1
otherwise,0
,1 1* ikCiik
ikuu
u
Datamining for ICT & Economics, part 1: Direct Mailing
52
Fuzzy clustering
Partition data into overlapping setsbased on similarity amongst patterns
Given the data
Find the fuzzy partition matrix:
and the cluster centres:
Nkxxx nTnkkkk ,,1,],,,[ 21 x
CNC
N
uu
uu
1
111
U
niC vvvV },,{ 1
Datamining for ICT & Economics, part 1: Direct Mailing
53
Fuzzy clusteringMinimize objective function
subject to
),(),,( 2
1 1ik
C
i
N
k
mik duJ vxVUX
NkuCi ik ,,1,11
NkCiuik ,,1,,,1,10
CiNuik
Nk ,,1,0 1
membership degree
total membership
no cluster empty
),1( m is the fuzziness parameter
Datamining for ICT & Economics, part 1: Direct Mailing
54
Feature selection
• R1=Number of weeks since last response• R2=Number of months since first-ever
donation• F1=Fraction of responded mailings• F2=Response time for last response
(median)• M1=Average donated amount per mailing• M2=Last donated amount• M3=Average donation per year
Datamining for ICT & Economics, part 1: Direct Mailing
55
Feature reduction
• Use logistic regression to build a target selection model
• Use only features whose corresponding weights deviate significantly from zero
• Selected features– Number of weeks since last response(TIMELR)
– Number of months since first-ever donation(TIMECL)
– Fraction of responded mailings(FRQRES)
Datamining for ICT & Economics, part 1: Direct Mailing
56
Feature reduction
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fraction selected
Re
spo
nse
fra
ctio
nTraining data
7 variables3 variables
Datamining for ICT & Economics, part 1: Direct Mailing
57
Fuzzy scoring model
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fraction selected
Res
pons
e fr
actio
n
Evaluation datafuzzy clusteringlogistic regression
40 clusters
Datamining for ICT & Economics, part 1: Direct Mailing
58
Fuzzy segmentation model
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fraction selected
Res
pons
e fr
actio
n
Evaluation dataclassificationlogistic regression
40 segments
Datamining for ICT & Economics, part 1: Direct Mailing
59
Linguistic rules
-1 -0.5 0 0.5 1 1.5 20
0.5
1
timelr
mem
bers
hip
Membership functions
-1 -0.5 0 0.5 1 1.5 20
0.5
1
timecl
mem
bers
hip
-1 -0.5 0 0.5 1 1.5 20
0.5
1
frqres
mem
bers
hip
Ve
ry s
ho
rtV
ery
sh
ort
Ra
re
Short
Short
Infrequent
Long
Long
Frequent
Ve
ry lo
ng
Ve
ry lo
ng
Oft
en
domains are normalized
• If TIMELR is very short and TIMECL is not very short and FRQRES is often then response rate is 0.81
• If TIMELR is short and TIMECL is very long and FRQRES is often then response rate is 0.75
• If TIMELR is very short and TIMECL is very short and FRQRES is often then response rate is 0.65
• If TIMELR is short and TIMECL is short or long and FRQRES is often then response rate is 0.60
Datamining for ICT & Economics, part 1: Direct Mailing
60
Conclusions
• Fuzzy target selection with RFM features– Transparent models for target selection with
good prediction power– Product space fuzzy clustering– Accuracy surpasses statistical models– Transparency by linguistic rules
• Future research– estimation of uncertainty bounds of the model– modeling of donation amounts
Datamining for ICT & Economics, part 1: Direct Mailing
61
Frequency problem
• How many mails should I send to this client this year?
• Model as a Markov Decision Process (MDP)
• Theory is based on Markov Chains
• Start with small introduction to MC’s
Datamining for ICT & Economics, part 1: Direct Mailing
62
Markov Chain
1
2
3
….
t0 t1 t2 t3
• System with m states (here m = 3)
• from one stage to the next, the state jumps from state j to state k with probability p(j,k)
• p(j,k) is called transition probability
Datamining for ICT & Economics, part 1: Direct Mailing
63
Transition matrix
P =
p(1,1) … p(1,m). … .. … .. … .p(m,1) … p(m,m)
p(j,k) = Pr{ end up in k after 1 step | start in j }
p(2)(j,k) = Pr{ end up in k after 2 steps | start in j } =
=
m
i
kipijp1
),(),(
Datamining for ICT & Economics, part 1: Direct Mailing
64
Stationary distribution
• PP = P2 = transition matrix after two steps
• P n = transition matrix after n steps
• if n , P n Q with
Q =
q1 … qm
q1 … qm
q1 … qm
… … …
• {q1 , … ,qm} is stationary distribution
• property: Q Q = Q
Datamining for ICT & Economics, part 1: Direct Mailing
65
Example
0 ½ ½
½ 0 ½
½ ½ 0P =
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3Q =
Datamining for ICT & Economics, part 1: Direct Mailing
66
Markov Decision Process
1
2
3
….
t0 t1 t2 t3
system(3 states)
X agent(2 actions)
….
r0 r1 r2 r3rewards
S = { states }
A = { actions }
S and A finite….
Datamining for ICT & Economics, part 1: Direct Mailing
67
Transition, reward matrix, policy
• transition matrix: p(j, k, a) = probability of ending in state k after action a has been performed on state j
• matrix has m m |A| elements
• p: S S A [0,1]
• reward matrix: r(s,a) = the payoff the agent gets by taking action a in state s
• r: S A
• policy: (s) = the action that should be taken when the system is in state s
• : S A
Datamining for ICT & Economics, part 1: Direct Mailing
68
Optimal policy
*(s) = maxE(r0 + r1 + 2r2 + …)
= discount factor, denotes how heavy future rewards should weigh
0 < < 1
))(,(0 ssrr
m
i
sispiirr1
1 ))(,,()).(,(
etc.
Exactly:
Datamining for ICT & Economics, part 1: Direct Mailing
69
The frequency problem
• Stages: the planning periods (years)
• Actions: A = {0,1,2,3,4} the number of mailings to be sent
• System: client
• Agent: firm
• States: the RFM-profile of the client
Datamining for ICT & Economics, part 1: Direct Mailing
70
States
• g = gift size, in categories: 0,1,…,5
• m = number of mailings: 0,1,…,4
• r = number of responses: 0,1,…,4
• Example: s = (3,4,1) means the client received 4 mailings, responded to 1, with a fair-sized gift
• Number of states is 55 (not 6 x 5 x 5 since not all combinations are possible)
Datamining for ICT & Economics, part 1: Direct Mailing
71
Transition matrix
• Contains 55 x 55 x 5 = 15125 elements
• A probability where the second number of mailings (m) is unequal to the action(a) must be zero
• All other probabilities must be estimated from historical data!
Datamining for ICT & Economics, part 1: Direct Mailing
72
Reward matrix
• r(s,a) = the expected total amount donated in this period, by a client in state s, given the number of mails received (a), minus the cost of a mailings
• Also to be estimated from historical data
Datamining for ICT & Economics, part 1: Direct Mailing
73
Scenario’s + demo
• Optimal policy calculation: linear programming, value iteration, policy iteration
• Scenario’s: give extra weight to some states
• Demonstration of the prototype decision support system we developed with the charity organization
Datamining for ICT & Economics, part 1: Direct Mailing
74
No historic data available?
• Transition and reward matrix cannot be estimated!
• Learn the optimal policy by reinforcement learning (without knowing P and R)
• Will be our next project together with Uzay Kaymak and Michiel van Wezel
Part 2Brand Choice usingEnsemble Methods
Rob Potharst , Michiel van Rijthoven and Michiel van Wezel
Erasmus University RotterdamEconometric Institute
76 Datamining for ICT & Economics, part 2: Brand Choice
Outline
• Ensemble methods: bagging, boosting, stacking, etc…– What? Why? How?
• Brand choice– classical statistical models– neural network models– ensemble methods
77 Datamining for ICT & Economics, part 2: Brand Choice
ReferencesOn ensemble methods:
• Hastie, Tibshirani & Friedman: The elements of Statistical Learning, Springer Verlag, 2001
On the brand choice application:
• R. Potharst, M. van Rijthoven and M. van Wezel, "Modeling Brand Choice using Boosted and Stacked Neural Networks", In: Kevin E. Voges et al. (Ed.), Business Applications and Computational Intelligence, to be published in 2005 by Idea Group Inc.
• M. van Wezel and R. Potharst, "Brand Choice, Bagging, Boosting, Bias and Variance". Technical Report, sept '04, submitted for publication.
• Vroomen, B., Franses, P.H. & van Nierop, E., (2004). Modeling consideration sets and brand choice using artificial neural networks. European Journal of Operational Research, 154, 206-217.
78 Datamining for ICT & Economics, part 2: Brand Choice
Ensemble? What does it mean?
• an ensemble is a group, instead of an individual
• two know more than one, and more than two maybe even more
• examples: – from real life: voting, committee, weather
men– from computational intelligence: a set of
neural network / decision tree models
79 Datamining for ICT & Economics, part 2: Brand Choice
How can you combine models?
Depends on your error loss function, but in general:
• for classification problems: voting!
• for regression problems: averaging!
80 Datamining for ICT & Economics, part 2: Brand Choice
Example
81 Datamining for ICT & Economics, part 2: Brand Choice
Does it really work?• Hm.. it seems a bit simple…• How could we check that it possibly works?• Try it on a simple problem• Credit scoring dataset from internet
(UCI Machine Learning Repository:http://www.ics.uci.edu/~mlearn/MLRepository.html)
• Want a loan? Are you credit worthy: will you pay back yes/no?
82 Datamining for ICT & Economics, part 2: Brand Choice
Credit scoring example
size of ensemble
% correct
83 Datamining for ICT & Economics, part 2: Brand Choice
84 Datamining for ICT & Economics, part 2: Brand Choice
What is a (base) classifier?
something that takes an input vector and assigns an (estimated) class label to it:
85 Datamining for ICT & Economics, part 2: Brand Choice
What kind of classifiers?
We will use classifiers that learn to do their job from a training set of examples (or instances or patterns):
How to train such a classifier is the subject of Machine Learning, a field very close to Computational Intelligence.
)},(),...,,(),,{( 2211 nn yyyT xxx
86 Datamining for ICT & Economics, part 2: Brand Choice
Classifiers: neural networks
87 Datamining for ICT & Economics, part 2: Brand Choice
Universal Approximation
• Neural networks can implement arbitrarily shaped boundaries between classes.
• This is called the universal approximation theorem for 1 hidden layer feedforward neural networks (a.o. Barron, 1993)
• By adding hidden nodes (and training examples!) one can get as close to the real boundaries as one wishes.
88 Datamining for ICT & Economics, part 2: Brand Choice
89 Datamining for ICT & Economics, part 2: Brand Choice
Decision boundaries of neural network
90 Datamining for ICT & Economics, part 2: Brand Choice
Decision boundaries of traditional model (multinomial logit)
91 Datamining for ICT & Economics, part 2: Brand Choice
Classifiers: decision trees
from: Mitchell, Machine learning, 1997
92 Datamining for ICT & Economics, part 2: Brand Choice
Decision tree: example
93 Datamining for ICT & Economics, part 2: Brand Choice
Problem 1: overfitting
94 Datamining for ICT & Economics, part 2: Brand Choice
Problem 2: Instability
• That is: the model is very dependent on the specific training set you have; take one out and…
95 Datamining for ICT & Economics, part 2: Brand Choice
Instability of a model
• Let us study the instability of a model in the setting of a regression problem:
• 5 training sets of each 200 examples• 5 linear models (blue)• 5 neural networks (green)• dotted lines: the average models!
noise 1
sin2
x
xy
96 Datamining for ICT & Economics, part 2: Brand Choice
97 Datamining for ICT & Economics, part 2: Brand Choice
Prediction error
• If we use a squared error loss function the prediction error is:
• Define the average model of a classifier as
• And let f* be the “real” underlying model ( = E(y|x))
2, ))(()( yfEfPE TyT xx
)()( xx TTA fEf
98 Datamining for ICT & Economics, part 2: Brand Choice
Bias-variance decomposition• Then we can derive the following formula:
• Or: Prediction error = irreducible error + bias2 + variance
• So: if we try to approximate the average model, we get variance 0. This is the idea behind ensemble methods!
2,
2*2
))()((
))()(()(
xx
xx
x
x
ATT
AnoiseT
ffE
ffEfPE
99 Datamining for ICT & Economics, part 2: Brand Choice
Bagging• abbreviation for bootstrap aggregating• Breiman(1994)• create "bootstrapped" datasets by randomly
drawing from the original dataset, with replacement!
• the bootstrapped datasets have the same size as the original dataset
• build a model for each bootstrapped dataset• combine these models by averaging or voting
100 Datamining for ICT & Economics, part 2: Brand Choice
101 Datamining for ICT & Economics, part 2: Brand Choice
Boosting• Adaboost (Freund & Shapire, 1996)• Each example in trainingset has a weight
attached to it; initial weights w1=1/N• Generate a sequence of models:• build model M1 on trainingset with w1
• examples, misclassified by M1 get a higher weight
• etc.
102 Datamining for ICT & Economics, part 2: Brand Choice
103 Datamining for ICT & Economics, part 2: Brand Choice
Adaboost (= adaptive boosting)for 2 classes: -1,+1
Nwi
1
N
iimiim xFywerr
1
))((
1. Initialize the boosting weights: for i = 1,…, N
2. For m = 1 to M perform each of the following:
a) Train model Fm(x) on T with weights wi
b) Compute
104 Datamining for ICT & Economics, part 2: Brand Choice
c) Compute
)1
log(m
mm err
err
d) Redefine the weights:
)))((exp( imimii xFyww
e) Normalize the weights:
N
kk
ii
w
ww
1
3. Output the final combined model:
))(sgn()(1
M
mmm xFxO
105 Datamining for ICT & Economics, part 2: Brand Choice
Stacking
• two levels of learning:– first level: train several models on training
set– second level: again train the combination
of these models
• not a fixed voting scheme for combining the models, but: learn an optimal combination method from the data
106 Datamining for ICT & Economics, part 2: Brand Choice
Brand Choice• classical topic in marketing• a product has k brands• consumer/household wants to buy product• which brand does he pick?• given:
– household characteristics (income, etc)– product factors (price, etc)– situational factors (product on display, etc)
107 Datamining for ICT & Economics, part 2: Brand Choice
Modeling brand choice
• classical statistical models (multinomial logit, conditional logit, etc): linear
• neural network models: nonlinear
• a model by Vroomen et al., 2004: neural networks with built-in so-called consideration sets
108 Datamining for ICT & Economics, part 2: Brand Choice
the Vroomen model• as many hidden nodes as there are
brands• three types of variables:
– X: household characteristics, eg size, income– Z: brand characteristics, eg price-level,
promotion, advertising– W: choice-specific characteristics, eg
observed price at purchase occasion
109 Datamining for ICT & Economics, part 2: Brand Choice
the Vroomen model
110 Datamining for ICT & Economics, part 2: Brand Choice
J
m qm
Q
qqmk
J
kkmm
qj
Q
qqjk
J
kkjj
j
WCS
WCS
FC
111
0
110
)exp(
)exp(
)(11
0
P
ppjpj
I
iiijjj ZXFCS
xexF
1
1)( = logistic function
the Vroomen model
111 Datamining for ICT & Economics, part 2: Brand Choice
Dataset
• scanner data: 3055 purchases of liquid detergent of 6 brands (part of ERIM database)
• 400 households• 4 X variables (volume, non-det, size, time)• 4 Z variables (price, feature, display,
recency)• 1 W variable (again price)
112 Datamining for ICT & Economics, part 2: Brand Choice
Experiments1. split 400 households randomly into
three groups (tr 200, va 100, te 100)2. use backprop for Vroomen, test on
va+te3. use 25 iterations on boosting alg, test
combined model on va + te4. use stacking on the 25 models, use va
to find coefficients, test on teRepeat this whole cycle 10 times!
113 Datamining for ICT & Economics, part 2: Brand Choice
Results
3 to 4 % gain in predictive performance
114 Datamining for ICT & Economics, part 2: Brand Choice
Conclusions
• By using ensemble methods we can increase the predictive performance
• on the other hand: because we get combined models they are harder to interpret
• future work: interpreting the combined model!
Part 3On the Use of Ensemble
Techniques for Modeling Choice Problems in Marketing, especially
Churnby
Aurélie Lemmens , Rob Potharst, Michiel van Wezel (Erasmus University Rotterdam)
and Christophe Croux
(Catholic University Leuven)
Datamining for ICT & Economics, part 3: Churn
116
Characteristics of Ensemble Techniques
• Developed in statistics / datamining / machine learning communities
• Not yet applied to marketing problems (a.f.a.w.k.)• High potential for choice problems such as brand choice
and churn• Successfully applied to other fields like fraud detection,
text categorization, chemometrics• Especially successful wrt predictive power, which can be
directly translated into money• Easy to apply
Datamining for ICT & Economics, part 3: Churn
117
How do Ensemble methods work?
1. Develop a number of so-called base models for a problem
Could be any model: dt, nn, logit, …
2. Combine these base models into a final choice model
Combination can be done with: voting, weighted voting, …
Datamining for ICT & Economics, part 3: Churn
118
Existing Ensemble methods
• Bagging (= Bootstrap Aggregating) Breiman, 1996
• Boosting: – Adaboost, Freund & Shapire, 1996– Stochastic gradient boosting, Friedman, 2002
• Stacking Wolpert, 1992
Datamining for ICT & Economics, part 3: Churn
119
Based on 4 recent papers
[1] “Bagging and Boosting Classification Trees to Predict Churn” to appear in Journal of Marketing Research, 2006
(by Lemmens and Croux)
[2] “Bagging a Stacked Classifier” appeared in 2005
(by Croux, Joossens and Lemmens)[3] “Modeling Brand Choice using Boosted and Stacked
Neural Networks” appeared in 2006
(by Potharst, van Rijthoven and van Wezel)[4] “Improved Customer Choice Predictions using
Ensemble Methods” submitted to European J Oper Res (by van Wezel and Potharst)
Datamining for ICT & Economics, part 3: Churn
120
Ensemble techniques used
paper bagging boosting stacking
[1] X X
[2] X X
[3] X X
[4] X X
Datamining for ICT & Economics, part 3: Churn
121
Base learners used
paper DT NN LDA LR
[1] X
[2] X X X X
[3] X
[4] X
Datamining for ICT & Economics, part 3: Churn
122
Marketing problems considered
paper churn brand choice
[1] X
[2]
[3] X
[4] X
Datamining for ICT & Economics, part 3: Churn
123
Data sets
paper Company / sector
[1] US wireless telecom company
[2] 12 benchmark datasets from machine learning
[3] Scanner data for six brands of liquid detergent
[4] Scanner data for ketchup / peanut butter brands
Datamining for ICT & Economics, part 3: Churn
124
Based on 4 recent papers
[1] “Bagging and Boosting Classification Trees to Predict Churn” to appear in Journal of Marketing Research, 2006
(by Lemmens and Croux)
[2] “Bagging a Stacked Classifier” appeared in 2005
(by Croux, Joossens and Lemmens)[3] “Modeling Brand Choice using Boosted and Stacked
Neural Networks” appeared in 2006
(by Potharst, van Rijthoven and van Wezel)[4] “Improved Customer Choice Predictions using
Ensemble Methods” submitted to European J Oper Res (by van Wezel and Potharst)
in depth
Datamining for ICT & Economics, part 3: Churn
125
– The 2002 Churn Tournament organised by Teradata Center for
CRM at Duke University
– Churn means defecting from a company, i.e. take his business
elsewhere
– Customer database from an anonymous U.S. wireless telecom
company
– Challenge: predicting churn for elaborating targeted retention
strategies (Bolton et al. 2000, Ganesh et al. 2000, Shaffer and
Zhang 2002)
– Details can be found in Neslin et al. (2004)
The Context
Datamining for ICT & Economics, part 3: Churn
126
– The US Wireless Telecom market (2004)
• 182.1 million subscribers
• Leader in market share: Cingular Wireless
– 26.9% total market volume
– turnover US$19.4 billion / net income US$201 million
• Other major players: AT&T, Verizon, Sprint and Nextel
• Mergers & Acquisitions : Cingular with AT&T Wireless &
Sprint with Nextel
The Context (cont’d)
Datamining for ICT & Economics, part 3: Churn
127
– Churn
• High churn rates 2.6% a month
• Causes: increased competition, lack of
differentiation, market saturation
• Cost: $300 to $700 cost of replacement of a lost
customer in terms of sales support, marketing,
advertising, etc.
• Targeted retention strategies
The Context (cont’d)
Datamining for ICT & Economics, part 3: Churn
128
Formulation of the Churn Problem
• Churn as a Classification issue:
Classify a customer i characterized by k variables
xi = (xi1 , xi2 , …, xiK ) as
– Churner yi = + 1
– Non-churner yi = - 1
• Churn is the response binary variable to predict: yi = f(xi )Choice of the binary choice model f ( . ) ?
Datamining for ICT & Economics, part 3: Churn
129
Classification Models in Marketing• Simple binary logit choice model (e.g. Andrews et al. 2002)
• Models allowing for the heterogeneity in consumers’
response:
– Finite mixture model (e.g. Wedel and Kamakura 2000)
– Hierarchical Bayes model (e.g. Yang and Allenby 2003)
• Non-parametric choice models:
– Decisions trees, neural nets (e.g. Thieme et al. 2000; West et
al. 1997)
– Bagging (Breiman 1996), Boosting (Freund and Schapire
1996), Stochastic gradient boosting (Friedman 2002)
Datamining for ICT & Economics, part 3: Churn
130
Classification Models in Marketing• Simple binary logit choice model (e.g. Andrews et al. 2002)
• Models allowing for the heterogeneity in consumers’
response:
– Finite mixture model (e.g. Wedel and Kamakura 2000)
– Hierarchical Bayes model (e.g. Yang and Allenby 2003)
• Non-parametric choice models:
– Decisions trees, neural nets (e.g. Thieme et al. 2000; West et
al. 1997)
– Bagging (Breiman 1996), Boosting (Freund and Schapire
1996), Stochastic gradient boosting (Friedman 2002)Mostly ignored in the marketing literature
S.G.B. won the Tournament (Cardell, from Salford Systems)
131
Decision Trees for Churn
Change in consumption
Customer care calls
< 0.5 ≥ 0.5
≥ 3< 3
Age
Yes
≥ 55
55< & ≥ 26 < 26
No
Handset price
≥ $150 <$150
No Yes
Yes No
Example:
Datamining for ICT & Economics, part 3: Churn
Datamining for ICT & Economics, part 3: Churn
132
Bagging and Boosting
• Machine Learning Algorithms
• Principle: classifier aggregation (Breiman, 1996)
• Tree-based method (e.g. Currim et al. 1988)
• Bagging: Bootstrap AGGregatING
Datamining for ICT & Economics, part 3: Churn
133
Calibration sampleZ = {(xi , yi ) }, i = 1, …, N
Random sample Z1*
Random sample Z2*
xf *1̂
xf *2̂
e.g. tree
Datamining for ICT & Economics, part 3: Churn
134
Aggregating bootstrap samples
. . .
xf *2̂
xf *1̂
xf *3̂
xfB*ˆ
…
B
bbbag xf
Bxf
1
*ˆ1)(ˆ
Churn propensity score:
Churn classification:
)(ˆ)(ˆ xfsignxc bagbag
Datamining for ICT & Economics, part 3: Churn
135
• Let the calibration sample be Z={(x1,y1), …, (xi,yi), …, (xN ,yN)}
• B bootstrap samples
• From each , a base classifier (e.g. tree) is estimated,
giving B score functions:
• The final classifier is obtained by averaging the scores
• The classification rule is carried out via
BbZb ,,2 ,1 ,*
B
bbbag xf
Bxf
1
*ˆ1)(ˆ
*bZ
xfxfxf Bb***
1ˆ,, ˆ,, ˆ
)(ˆ)(ˆ xfsignxc bagbag
Bagging
Datamining for ICT & Economics, part 3: Churn
136
• Winner of the Teradata Churn Modeling Tournament
(Cardell, Golovnya and Steinberg, Salford Systems).
• Data adaptively resampled
Stochastic Gradient Boosting
• Previously misclassified observations weights
• Previously well-classified observations weights
Datamining for ICT & Economics, part 3: Churn
137
Data
Time
Customer
Balanced
Sample
Proportional
Sample
Calibration Sample Validation Hold-Out Sample
yi = + 1
yi = + 1
yi = - 1
yi = - 1
Xi = (x1,…, x46) yi
Xi=(x1,…, x46) yi
Behavioral predictorse.g. the average monthly minutes of use
Company interaction’s variablese.g. mean unrounded minutes of customer care calls
Customer demographicse.g. the number of adults in the household
N = 51,306
N=100,462Real-life proportion of churners = 1.8%
Equal proportion of churners = 50%
Datamining for ICT & Economics, part 3: Churn
138
Research Questions
• Do bagging (and boosting) provide better results
than other benchmarks?
– What are the financial gains to be expected from this improvement?
– What are the more relevant churn drivers or triggers that marketers
could watch for?
• How to correct estimated scores obtained from a
balanced calibration sample, when predicting rare
events like churn?
Datamining for ICT & Economics, part 3: Churn
139
Comparing Error Rates…Model* Validated Error
Rate**
Binary Logit Model 0.400
Bagging (tree-based) 0.374
Stochastic Gradient Boosting 0.460
* Model estimated on the balanced calibration sample** Error rates computed on the hold-out proportional validation sample
Datamining for ICT & Economics, part 3: Churn
140
Bias due to Balanced Sampling
• Overestimation of the number of churners
• Several bias correction methods exist (see e.g. Cosslett
1993; Donkers et al. 2003; Franses and Paap 2001, p.73-75; Imbens and
Lancaster 1996; King and Zeng 2001a,b; Scott and Wild 1997).
• However, most are dedicated to traditional models (e.g.
logit). We discuss two corrections for bagging and boosting.
Datamining for ICT & Economics, part 3: Churn
141
The Bias Correction Methods• The weighting correction:
Based on marketers’ prior beliefs about the churn rate, i.e. the
proportion of churners among their customers, we attach
weights to observations of a balanced calibration sample.
• The intercept correction:
Take a non-zero cut-off value τB such that the proportion of
predicted churners in the calibration sample equals the actual
a priori proportion of churners.
Datamining for ICT & Economics, part 3: Churn
142
• Let the calibration sample be Z={(x1,y1), …, (xi,yi), …, (xN ,yN)}
• B bootstrap samples
• From each , a base classifier (e.g. tree) is estimated,
giving B score functions:
• The final classifier is obtained by averaging the scores
• The classification rule is carried out via
BbZb ,,2 ,1 ,*
B
bbbag xf
Bxf
1
*ˆ1)(ˆ
*bZ
xfxfxf Bb***
1ˆ,, ˆ,, ˆ
Bbagbag xfsignxc )(ˆ)(ˆ
Bagging
Datamining for ICT & Economics, part 3: Churn
143
Assessing the Best Bias Correction…
Bias Correction
No correction Intercept Weighting
Model* Validated Error Rates**
Binary logit model 0.400 0.035 0.018
Bagging (tree-based)
0.374 0.034 0.025
S.G. boosting 0.460 0.034 0.018
* Model estimated on the balanced calibration sample** Error rates computed on the hold-out proportional validation sample
Datamining for ICT & Economics, part 3: Churn
144
The Top-Decile Lift• Focuses on the most critical group of customers
regarding their churn risk: Ideal segment for targeting
a retention marketing campaign
• The top 10% riskiest customers
– With = the proportion of churners in this risky segment
– And = the proportion of churners in the whole validation set
Risk to churn
10%
ˆ
ˆlift decile-Top %10
%10̂̂
Datamining for ICT & Economics, part 3: Churn
145
Financial Gains: Neslin et al. (2004)
– N : customer base of the company
– α : percentage of targeted customers (here, 10%)
– ΔTop decile : increase in top-decile lift
– γ : success rate of the incentive among the churners
– LVC : lifetime value of a customer (Gupta, Lehmann and Stuart 2004)
– δ : incentive cost per customer
– ψ : success rate of the incentive among the non-churners.
LVCdecileTopNGain ˆ
146
0 20 40 60 80 100
Number of iterations
1.6
1.8
2.0
2.2
2.4
2.6
Top d
eci
le*
BaggingStochastic Gradient BoostingBinary Logit Model
Top-Decile Lift with Intercept Correction
* Model estimated on the balanced sample, and lift computed on the validation sample.
+26%
Datamining for ICT & Economics, part 3: Churn
147
Validated** Top-Decile Lift
Model*No / Intercept
correctionWeighting correction
Binary logit model 1.775 1.764
Bagging (tree-based) 2.246 1.549
Stochastic gradient boosting
2.290 1.632
* Model estimated on the balanced calibration sample** Error rates computed on the hold-out proportional validation sample
Datamining for ICT & Economics, part 3: Churn
148
Financial Gains
If we consider
– N : customer base of 5,000,000 customers
– α : 10% of targeted customers
– γ : 30% success rate of the incentive among the churners
– LVC : $2,500 lifetime value of a customer
– δ : $50 incentive cost per customer
– ψ : 50% success rate of the incentive among the non-churners
LVCdecileTopNGain ˆ
Datamining for ICT & Economics, part 3: Churn
149
Financial Gains
Additional financial gains that we may expect from a retention marketing campaign which would be targeted using the scores predicted by the bagging instead of the logit model:
ΔTop decile : 0. 471 (= 2.246 – 1.775)
Gain = + $ 3,214,800
Additional financial gains that we may expect from a retention marketing campaign which would be targeted using the scores predicted by the bagging instead of a random selection:
ΔTop decile : 1.246 (= 2.246 – 1.000)
Gain = + $ 8,550,000
Datamining for ICT & Economics, part 3: Churn
150
Most Important Churn Triggers
Bagging
151
Partial Dependence Plots
-1000 0 1000 2000
Change in monthly min. of use
48
50
52
54
56
58
60
62
Pro
bability t
o c
hurn
0 500 1000 1500
Equipment days
44
46
48
50
52
54
56
Pro
bability t
o c
hurn
Bagging
Datamining for ICT & Economics, part 3: Churn
152
Partial Dependence Plot
Pro
bab
ilit
y to
ch
urn
49
50
51
Datamining for ICT & Economics, part 3: Churn
153
Conclusions: Main Findings
1. Bagging and S.G. boosting are substantially better
classifiers than the binary logit choice model
– Improvement of 26% for the top-decile lift,
– Good diagnostic measures offering face validity,
– Interesting insights about potential churn drivers,
– Bagging is conceptually simple and easy-to-implement.
2. Intercept correction constitutes an appropriate bias
correction for bagging when using balanced sampling
scheme.
154
Appendix: From Profit to Financial Gain
LVCdecileTopN
LVCN
ˆ
ˆ-ˆ
ProfitProfitGain
2 1
2 1 2-1
cLVCN 1111 ˆ1ˆ ˆ Profit
LVC of a churner
who does not
churn
Incentive cost for the
churners retained
+ non-churners
targeted
Contact
cost
ˆ/ ˆdecile Top 1 1