doing customer intelligence with r - data science for ... · 1/3/2004 · definition: “doing...
TRANSCRIPT
1111
\
October 19, 2004
Presented at the SDForum Business Intelligence SIGPalo Alto, CA
Loyalty Matrix, Inc., 580 Market St., Suite 600, San Francisco, CA 94104 (415) 296-1141www.LoyaltyMatrix.com R.LoyaltyMatrix.com
Doing Customer Intelligence with R
By Jim PorzakDirector of AnalyticsLoyalty Matrix, Inc.
PD
F processed w
ith CuteP
DF
evaluation editionw
ww
.CuteP
DF
.com
2222
� Introduction to Customer Intelligence at Loyalty Matrix
� Introduction to R
� R for Exploratory Data Analysis (EDA)
� R for Statistics & Data Mining
� Summary and Q&A
3333
What is Customer Intelligence (CI)?
� Twenty years ago a mentor in the Valley told me: “Jim, take care of your customers. They will take care of you.”
� The fundamentals have not changed
� But the data, tools and techniques have become much richer
� Definition: “Doing Customer Intelligence is analyzing customer behavioral and motivational data to build actionable business insights that can change marketing strategy and tactics to better align the organization with it’s customers needs and expectations.”
� Customer Intelligence is multifaceted
� It is quantitative – insights must be based on facts
� It is perceptive – insights are business interpretation of the numbers
� It is actionable – insights must usable for the business
� It is iterative – application generates more data & more insights
4444
Loyalty Matrix, Inc.
� A privately held marketing services company based on technology
� A combination of seasoned marketers, techies and proprietary technology providing our clients with unparalleled customer intelligence solutions focused on:
� Customer Acquisition
� Customer Retention
� Product Cross-sell and Up-sell
� Founded in 2001
� A San Francisco based firm with offices in Dallas, Chicago, and (soon) New York
� Key Goal: Transform Customer Data into Actionable Insights!
5555
Customer Intelligence (CI) in Action
�Case study: Apple .MAC Subscribers
� Business Challenge: Maximize renewal rates
� CI Insight: Early usage drives loyalty
� Business Impact: Shift marketing effort to drive new subscriber usage (depth then breadth)
� Case study: St. Regis Hotel Guests
� Business Challenge: Reduce high customer attrition
� CI Insight: Decline in loyal guests masked by boom economy single visit guests.
� Business Impact: Focus on regaining former loyal guests
6666
The MatrixOptimizer®: Data Insights Action
Customer Company
Transactional
Research data
Survey data
3rd party data
Web site usage data
The data the describes the relationship
Pricing
Product
Distribution
Data Strategies & Collection
Communication
Loyalty ProgramDesign & Evaluation
The results
MatrixOptimizer®
7777
The CI Framework at Loyalty Matrix
8888
Inside the CI solution, the MatrixOptimizer®
� Data Mart in Microsoft SQL 2000
� Holds “ready to report data” in dimensional schema.
� Major effort to load client data, quality checks, cleanse
� Client data staged in SQL
� OLAP cubes built with Microsoft Analysis Services
� Built off of Data Mart
� Each cube focuses on single set of business problems
� OLAP presentation layer built with eBlocks
� Web access to standard reports
� Allows slice and dice
� Version 3.0 released October, 2004
9999
How can R help build Customer Intelligence?
Challenges with Classical CI
� Aggregations limited to counts, sums & means.
� Nature of customer data
� Events with related # and/or $ values
� #’s, $’s & intervals all highly right skewed distributions
� Data quality often suspect – especially in dimensions (factors)
� No rigorous tests
� No advanced methods
R to the Rescue!
� Visualization
� Take first look at raw data
� EDA on “clean” data
� Classical stats
� Differences really significant?
� Efficacy of marketing efforts?
� Prediction & Modeling
� Classification
� Meaningful customer attributes
10101010
� Introduction to Customer Intelligence at Loyalty Matrix
� Introduction to R
� R for Exploratory Data Analysis (EDA)
� R for Statistics & Data Mining
� Summary and Q&A
11111111
Evolution of R from S
� R is the free (GNU), open source, version of S
� S developed by John Chambers and colleagues while at Bell Labs in 80’s
� For “data analysis and graphics”
� Version 4 defined by the “Green Book” Programming with Data, 1998
� S-Plus now owned and developed by Insightful Corp., Seattle, WA
� R was initially written by Robert Gentleman and Ross Ihaka
� In early 1990’s
� Statistics Department of the University of Auckland
� GNU GPL release in 1995
� Since 1997 a core group of 17 developers has had write access to the source
� V1.0 released in February, 2000
� New 0.1 level release ~ 6 months
12121212
Current state of R
� V2.0 Released October, 2004
� Windows, Mac OS, Linux & Unix ports
� Over 400 submitted packages from “abind” to “zoo”
� 12th newsletter (Volume 4/2) published September 2004
� The first useR! – R User Conference held in Vienna May 2004
� ~400 R-help newsgroup messages per week
� ~ Dozen texts specifically on R or with R examples and code
� R language generally accepted to be more powerful than S-Plus
� Some interesting GUI work in progress
13131313
R Resources
� R Homepage: http://www.r-project.org/
� The official site of R
� R Foundation: http://www.r-project.org/foundation/
� Central reference point for R development community
� Holds copyright of R software and documentation
� Support it!
� Local CRAN:
� Mirror site
� Current Binaries
� Current Documentation
� Link to related projects and sites
� R.LoyaltyMatrix.com blog
14141414
� Introduction to Customer Intelligence at Loyalty Matrix
� Introduction to R
� R for Exploratory Data Analysis (EDA)
� R for Statistics & Data Mining
� Summary and Q&A
15151515
Visualization is Key to EDA
Getting information from a table is like extracting sunlight from a cucumber.
– Farquhar & Farquhar, 1891.
If I can’t picture it, I can’t understand it. – Albert Einstein
You can see a lot, just by looking. – Yogi Berra
Thanks to Michael Friendly of VCD fame.
16161616
Our Favorite Visualization Methods
� For first look and exploration
� Frank Harrell’s datadensity for quick look, quality checks, …
� Scatter Plots for patterns, outliers, …
� Box Plots for median, range, outliers
� To understand customer behavior
� Interval Histograms for time between visit, purchase, …
� Distance Histograms for customer to store travel
� Geographical Maps for customer to store travel
� Exploration for correlations and associations
� Scatterplot Matrices
� Mosaic Plots for categorical associations
17171717
Basic Exploratory Data Analysis (EDA)
18181818
Everything We Know About the Fact Table – Example 1
19191919
Everything We Know About the Fact Table – Example 2
20202020
Scatter Plot Shows Correlation & Outlier Suspects.
0 e+00 1 e+07 2 e+07 3 e+07 4 e+07 5 e+07 6 e+07 7 e+07
050
00
0100
00
015
00
00
200
00
02002 Only
aum
fee
s
146917150105
174765
183667
185138
193045
199963
A.G. Edwards & Sons Inc. RBC Dain Rauscher Inc.
Merrill Lynch Pierce Fenner & Sm
Merrill Lynch Pierce Fenner & Sm
Salomon Smith Barney
Merrill Lynch Pierce Fenner & Sm
Salomon Smith Barney
UBS Financial Services, Inc.
21212121
Box Plots for Skewed Data like $’s & #’s
22222222
Boxplots for Strongly Skewed $ Fees by #’s of Accounts
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
05
00
01
00
00
15
00
02
00
00
Fees by Number of New Account Opens in 2002
# New Accounts Opened in 2002
Fe
es
23232323
Boxplots for Restaurant Group $ Sales over Time
15Ju
n02
22Ju
n02
29Ju
n02
06Ju
l02
13Ju
l02
20Ju
l02
27Ju
l02
03A
ug02
10A
ug02
17A
ug02
24A
ug02
31A
ug02
07S
ep02
14S
ep02
21S
ep02
28S
ep02
05O
ct0
21
2O
ct0
21
9O
ct0
22
6O
ct0
20
2N
ov02
09N
ov02
16N
ov02
23N
ov02
30N
ov02
07D
ec02
14D
ec02
21D
ec02
28D
ec02
04Ja
n03
11Ja
n03
18Ja
n03
25Ja
n03
01F
eb03
08F
eb03
15F
eb03
22F
eb03
01M
ar0
30
8M
ar0
31
5M
ar0
32
2M
ar0
32
9M
ar0
30
5A
pr0
31
2A
pr0
31
9A
pr0
32
6A
pr0
303
Ma
y03
10
Ma
y03
17
Ma
y03
24
Ma
y03
31
Ma
y03
07Ju
n03
14Ju
n03
21Ju
n03
28Ju
n03
05Ju
l03
12Ju
l03
19Ju
l03
26Ju
l03
02A
ug03
09A
ug03
16A
ug03
23A
ug03
30A
ug03
06S
ep03
13S
ep03
20S
ep03
27S
ep03
04O
ct0
31
1O
ct0
31
8O
ct0
32
5O
ct0
30
1N
ov03
08N
ov03
15N
ov03
22N
ov03
29N
ov03
06D
ec03
13D
ec03
20D
ec03
27D
ec03
03Ja
n04
010
000
2000
030
000
400
00
50
000
Total $'s of Presented Tickets by Week(Full Term Restaurants Only)
Week
$ A
mou
nt
24242424
Customer Purchase Intervals
25252525
Visit Interval Histogram for Restaurant/Bar/Arcade
Days between 1st and Last use of a Credit Card for PowerCard Transactions
DeltaDays
Fre
qu
en
cy
0 20 40 60 80 100 120 140
05
00
10
00
15
00
20
00
25
00
30
00
26262626
Purchase Interval for Vehicles Shows Key Customer Types
0 50 100 150
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
0.0
25
0.0
30
0.0
35
Time from Prior Purchase Segments by Months from Prior Purchase for GM Repeat Households
Months from Prior Purchase
Pro
bib
ility
of P
urc
hase
N = 95793
Buy-A-Pair
Keep as Long as Reliable
3-Year Cycle
50% of Purchases
27272727
Distance Traveled and Geography
28282828
Distance Traveled by Customer to Restaurant
Distance Travelled by All iDine Member Dinersto SCHMIDT'S RESTAURANT UND SAUSA (52279)
July '02 - September '03
Miles Traveled
# V
isits
0 500 1000 1500 2000 2500 3000
02
00
40
06
00
1777 visits up to 50 miles
Distance Travelled by Local iDine Member Dinersto SCHMIDT'S RESTAURANT UND SAUSA (52279)
July '02 - September '03
Miles Traveled (First 50)
# V
isits
0 10 20 30 40 50
05
01
00
20
0 1777 visits up to 50 miles.
Distance Travelled by All iDine Member Dinersto STRADA WORLD CUISINE (27327)
July '02 - September '03
Miles Traveled
# V
isits
0 500 1000 1500 2000 2500 3000
02
04
06
08
0
1566 visits up to 50 miles
Distance Travelled by Local iDine Member Dinersto STRADA WORLD CUISINE (27327)
July '02 - September '03
Miles Traveled (First 50)
# V
isits
0 10 20 30 40 50
05
01
00
15
0
1566 visits up to 50 miles.
29292929
Map Home ZIP’s for Customers Dining at Restaurants
SCHMIDT'S RESTAURANT UND SAUSA (52279) Visits in Last 18 Months
0 20 4060 mi
STRADA WORLD CUISINE (27327) Visits in Last 18 Months
0 20 4060 mi
30303030
Geographic Density of Outlets Relative to Population
31313131
Geographic Density with Ethnic Targeted Coverage
32323232
Understanding Categorical Variables
33333333
Mosaic Plots Show Automotive Brand Loyalty
Sta
nd
ard
ize
dR
esid
ua
ls:
<-4
-4:-
2-2:0
0:2
2:4
>4
Division Last Pch
Div
isio
n P
rio
r P
ch
Chevy P B C OI GMC ? SG
Chevy
PB
CO
GM
C?SG
Last Division by Prior Division Same Consumer ( N = 5653 )
A B C D E F G H I J K
* F E D C B A
34343434
Mosaic Pairs Shows Categorical Responses in Survey
Mode
F M 1
Ma
ilW
eb
1 2 3 4+ *
Web
W M 3 6 6+*
Web
B A MP*
Web
Mail Web
FM
1
Gender
1 2 3 4+ *
FM
1
W M 3 6 6+*
FM
1
B A MP*
FM
1
Mail Web
12
34+
*
F M 1
12
34+
*
Age
W M 3 6 6+*
12
34+
*
B A MP*
12
34+
*
Mail Web
WM
36
6+
*
F M 1
WM
36
6+
*
1 2 3 4+ *
WM
36
6+
*
Last.Visit
B A MP*
WM
36
6+
*
Mail Web
BA
MP *
F M 1
BA
MP *
1 2 3 4+ *
BA
MP *
W M 3 6 6+*
BA
MP *
Quality
BR Survey - Catigorical Answers
35353535
Mosaic Shows Mailed vs. Control Purchases are Alike
Sta
nd
ard
ize
dR
esid
ua
ls:
<-4
-4:-
2-2
:00:2
2:4
>4
Travel Trailer
MorC
Cla
ss
C M
CLA
SS
AC
LA
SS
CF
IFT
H W
HE
EL
FO
LD
ING
TR
AIL
ER
TR
AV
EL T
RA
ILE
RT
RU
CK
CA
MP
ER
36363636
Customer Segment Migration
37373737
Customer Segment Migration
38383838
� Introduction to Customer Intelligence at Loyalty Matrix
� Introduction to R
� R for Exploratory Data Analysis (EDA)
� R for Statistics & Data Mining
� Summary and Q&A
39393939
Classical Statistics add Rigor
�Background
� Business analysts mostly use Excel
� Great for exploration and building presentations
� Tests of significance at best an after-thought
�Example: Test Direct Mail Campaign Effectiveness
� Offer: Test drive “RV” at your local dealer. Get a goodie.
– ~$100,000 is not an impulse purchase!
– Low conversion rates (proportions)
� Control group is “hold out” from each list.
� Are there incremental sales? How many? $ value?
40404040
Fisher.test Measures if Mail Campaign was Effective
>PA.Group <- c(64443, 12060) # Size of Mailed, Control groups
>PA.Sales <- c(139, 15) # Sales to Mailed, Control groups
>fisher.test(matrix(c(PA.Sales, PA.Group-PA.Sales), 2),
alternative="greater")
Fisher's Exact Test for Count Data
data: matrix(c(PA.Sales, PA.Group - PA.Sales), 2)
p-value = 0.02122
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
1.095079 Inf
sample estimates:
odds ratio
1.735754 We will be wrapping the Fisher test with an easy-to-use interface and result visualization as a Campaign Effectiveness module.
41414141
All Mail Lists for a Campaign
42424242
Marketing Campaign Visualization
43434343
Subscription Survival Analysis
� Probability of surviving a certain number of days?
� What is half live of typical subscriber?
� Hazard Ratio = (# Expired) / (# at risk) at each time interval
� H(t) = Ne(t) / Nr(t)
� Probability of Survival = product (1 – H(t)) up to time of interest
� Ps(t) = Π (1 – H(i) ; i = 1, t
� Need to know
� Number of days in subscription
� Expired or censored?
� See
� Gordon Linoff’s article in Intelligent Enterprise, August 2004
� Tableman & Kim, Survival Analysis Using S, Chapman & Hall, 2004
� R library: survival and more…
44444444
Subscription Survival and Hazard
0 200 400 600 800
0.0
0.2
0.4
0.6
0.8
1.0
Monthly Subscription
Days into Subscription
Pro
bib
ility
of S
urv
iva
l
00.0
10.0
20
.03
0.0
40
.05
0.0
6
Ha
zard
Ratio
45454545
Some R Code – Read data set & plot survival curve
library(RODBC)
channel <- odbcConnect(“MyDataSet", uid = “me", pwd = “xx")
sSQL <- readLines("qCustomerSubscriptions4R.sql", n = -1)
PBSub <- sqlQuery(channel, sSQL, na.strings = c("", " "))
nDays <- 900
NumSubs <- length(SubStat)
NumSuccumbed <- table(NumDaysInSub[SubStat=="Expire"])
NumAtRisk <- NumSubs - cumsum(table(NumDaysInSub))
+ table(NumDaysInSub)[1]
HazardProb <- NumSuccumbed[1:nDays]/NumAtRisk[1:nDays]
SurvivalProb <- cumprod(1-HazardProb)
plot(SurvivalProb, type = "s", ylim = c(0, 1),
xlab = "Days into Subscription",
ylab = "", main = “Subscription )
46464646
0 200 400 600 800
0.0
0.2
0.4
0.6
0.8
1.0
Monthly Subscription
Days into Subscription
Pro
bib
ility
of S
urv
iva
l
Playboy Cyber Club (N = 80905 , $k= 6187 )PlayboyNet (N = 16339 , $k= 976 )PlayboyTV Jukebox (N = 829 , $k= 156 )SpiceNet (N = 7495 , $k= 322 )SpiceTV Club (N = 1822 , $k= 186 )
Survival by Product Type for Monthly Subscriptions
47474747
Survival by Product Type for Yearly Subscriptions
0 200 400 600 800
0.0
0.2
0.4
0.6
0.8
1.0
Yearly Subscription
Days into Subscription
Pro
bib
ility
of S
urv
iva
l
Playboy Cyber Club (N = 109943 , $k= 8888 )PlayboyNet (N = 8910 , $k= 619 )PlayboyTV Jukebox (N = 481 , $k= 64 )SpiceNet (N = 1785 , $k= 102 )SpiceTV Club (N = 937 , $k= 158 )
48484848
Advanced Methods Add CI Insight
�Background
� High-end “data mining” usually done with SAS, SPSS, …
� Not generally available to front line business analysts
�Example: Customers purchasing different types of vehicles
� Two distinct product groups
– High end RV’s
– Entry level towables
� How do customer profiles differ?
– Continuous & categorical data
– Sales data + 3rd party demographic matching
� What can we learn for product design? Marketing?
49494949
randomForest Differentiates Between Customer Groups
UnitSizeOwnMCycleIPhotoAIntelIBoatSaleAmLkMLOwnRVISailIHuntShootFamPosIBikeGenderDwelTypeRegionMSA.TypeIFishHHCompAEnvironLORATraditPNE.LSGSaleMonthPNE.SGEstIncomeHHCldrnCMSAHouseValMBGrpMBSegAgeRange
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
MeanDecreaseAccuracy
Importance
> FWE.rf
Call:randomForest.default(x = FWEi, y = Type, mtry = 6, importance = TRUE,proximity = TRUE, outscale = TRUE)
Type of random forest: classification
Number of trees: 500No. of variables tried at each split: 6
OOB estimate of error rate: 28.95%
Confusion matrix:
LSV Analog TT Entry class.errLSV Analog 622 222 0.26303TT Entry 295 647 0.31316
randomForest package by Andy Liaw and Matthew Wiener based on original Fortran code by Leo Breiman and Adele Cutler
50505050
randomForest Shows Variable Importance by Group
OwnCompIPhysFitAIntelUnitSizeOwnCellAmLkMLIPhotoOwnMCycleMSA.TypeLORIMCycleFamPosIBikePNE.LSGIFishGenderOwnRVIHuntShootHHCompEstIncomeRegionPNE.SGSaleMonthDwelTypeCMSAHouseValHHCldrnMBSegMBGrpAgeRange
0.0 0.2 0.4 0.6 0.8
LSV Analog
Importance
DwelTypeIPhotoUnitSizeOwnCompRegionAmLkMLIPwrBoatHHIMAIntelGenderFamPosIBikeIBoatSaleHHCompISailIFishMSA.TypeHHCldrnPNE.SGSaleMonthPNE.LSGLOREstIncomeATraditAEnvironCMSAMBGrpMBSegHouseValAgeRange
0.0 0.2 0.4 0.6 0.8
TT Entry
Importance
51515151
Multi-dimensional Scaling Plot of rF Proximities
Dim 1
-0.2 0.0 0.2 -0.3 -0.1 0.1
-0.1
0.1
0.3
-0.2
0.0
0.2
Dim 2
Dim 3
-0.2
0.0
0.2
-0.3
-0.1
0.1
Dim 4
-0.1 0.1 0.3 -0.2 0.0 0.2 -0.2 0.0 0.2
-0.2
0.0
0.2
Dim 5
52525252
Association Analysis
� AKA “Shopping Basket” analysis
�Apriori algorithm by Christian Borgelt
53535353
Demographic Associations for High Spenders
54545454
R Lessons Learned at Loyalty Matrix
� R a flexible addition to “classical CI” methods
� Yes, there is a learning curve
� Very responsive R user community and support
� No problem with client acceptance
55555555
Questions, Comments?
� Now would be the time
� My email is [email protected]
� Our R Blog is R.LoyaltyMatrix.com