data mining to improve e-mail marketing

TaykoSmart Marketing using analytics

Business Problem

Tayko is a software catalog firm that sells games and educational software

Want to market a new collection using e-mail marketing. As member of an industry consortium, they can pull 2,00,000 emails

address from the central repository of the consortium. To maximize the benefit, Tayko wants to pull records with high

probability of response and higher value of sale.

Analytics Problem

1. Create a classification model to groups the customer as responder or purchasers(1) and non-responders or non-purchasers(0).

2. Create a prediction model to predict the value of sale of the responder(1).

Data Collection

Supervised learning techniques is to be applied as a desired output is required is already defined.

A sample of 2000 customer is drawn form the central repository and test e-mail marketing is done.

The 2 target variables : Purchased and Spending is recorded for the sample.

The result showed 1000 purchasers and 1000 non-purchasers

Data partitioning

The data set is partitioned into Training set – 60% - 1200 records Testing – 20% - 400 records Validation – 20% - 400 records

Initial StudyWhat kind of variables are present.

Finding the variables with strong differentiation power – Nominal Variables

Use of Catalog A, T, U, P show high percentage of people making a purchase

Use of Catalog O, H show high percentage of people not making a purchase

But only Catalog A & U has been used for more than 100 customers. Catalog H for more than 50 customers & rest below 50 customers. Distribution of catalogs were not even.

Other Nominal Variables

Out of other categorical variables : “Order Online” is the only one which show some power to differentiate between customer who purchased and the non-purchasers.

Ordinal Variables Number of purchase last year shows a good trend People who have not made any purchase last year

have not made any purchase with the new catalogs also.

People who had made more than 3 purchase has surly made a purchase this time also

Scale Variables

Out of the 2 scale variables “Last update to customer record” shows a significant difference in their mean.

Target Variables

Purchaser and non-purchasers are equally distributed However the sales value or the amount spend by customer follows a

non-normal distribution

ClassificationWho will make a purchase?

Logistic Regression – Training

Final set of variables1. Frequency : Number of transactions in last year at

source catalog 2. Web Order : Customer placed at least 1 order via

web 3. Address is Residence : Address is a residence 4. Source_a, h or u :Source Catalog is A, U or H

Logistic Regression – Testing & Validation

Test Over-all accuracy : 80%

Validation Over-all accuracy : 77%

Decision Tree – Training CHAID Growing method gave best results

Decision Tree – Test & Validate Test

Over-all accuracy : 76%

Validation Over-all accuracy : 74%

Result

Logistic regression gives a better result than decision tree

PredictionHow much a purchaser will spend?

New Calculated Variables

• High correlation between “last_update_days_ago ” and “1st_update_days_ago ”• New calculated variable DayDiff which is difference of

the 2 variables

Multiple Linear Regression

Pre-processiong Univariate analysis and transformation of Target Variable “Spend”

Outlier removal, Filtering and Transformation

Model & Performance

4 models are generated Case 1 : None Residence Address & Not a Web-Order (R-sqr : 0.569 & Adj R-sqr :

0.566)Spending = -15.733 + 79.11 * No of transaction last year – 47.825 * Catalog D + 30.632 * Catalog U Case 2 : None Residence Address & Web-Order (R-sqr : 0.62 & Adj R-sqr : 0.616)Spending = -42.285 + 115.976 * No of transaction last year + 45.506 * Catalog U -247.655 * Catalog H + 55.605 Catalog R Case 3 : Residence Address & Not a Web-Order (R-sqr : 0.516 & Adj R-sqr : 0.507)Spending = -26.965 + 69.218 * No of transaction last year + 66.219 * Catalog U – 113.587*Catalog H Case 4 : Residence Address & Web-Order (R-sqr : 0.612 & Adj R-sqr : 0.592)Spending = -4.616 + 65.114 * No of transaction last year - 111.934*Catalog H – 81.28 * Catalog R – 129.754 * Catalog C + 66.242 * Catalog A

MAD & MAPE

Training MAD : 68.89 MAPE : 103%

Test MAD : 104.53 MAPE : 109%

Validation MAD : 104.03 MAPE : 101%

Regression Tree Exhaustive CHAID

MAD & MAPE

Training MAD : 105.37 MAPE : 95%

Test MAD : 121.54 MAPE : 103%

Validation MAD : 121.31 MAPE : 113%

Decision

Both the models are very weak in predicting the amount spent There is high error for evaluation indicators. One major reason for this can be the lack of scale variables and high

correlation between whatever scale variables are given. Since most variables are of nominal type, converting the prediction

problem to classification might produce better result. But it was out of scope for the given problem.

Conclusion

The classification of customer into purchasers and non-purchasers shows good result and the elected logistic regression model is expected to show high performance in live situation also.

However the prediction models show weak performance and a high degree of error is expected if used in the current state.

data mining to improve e-mail marketing

Data & Analytics