day 3: predictive modeling & causal inferences beibei...
TRANSCRIPT
Day 1: BI & DA Overview, Business Cases - Individual Assignment Day 2: Machine Learning & Data Mining Basics - Group Assignment Day 3: Predictive Modeling vs. Causal Inferences - How to Interpret Regression Results - Causal Identification Strategies; - Economic Value of Online Word-of-Mouth; - Social Network Influence; - Multichannel Advertising Attribution; - Randomized Field Experiment of Mobile Recommendation. Day 4: Bridging Machine Learning with Social Science: - Case 1: Interplay Between Social Media & Search Engine; - Case 2: Understand and Predict Consumer Search and Purchase Behavior; - Case 3: Text Mining & Sponsored Search Advertising.
6
Model-Free Data Exploration & Visualization Unsupervised Learning (Pattern Discovery) (Market Basket Analysis, Association Rule, Clustering)
Supervised Learning (Predictive Modeling) (Decision Tree, Linear Regression, Logistic Regression)
9
•Market basket Analysis • Most important part of a business: what merchandise
customers are buying and when? •Association Rules
• Building association rules • How good are association rules
•Clustering • Group similar items • Consumer Segmentation
11
Rule format: If {set of conditions} Then {set of results} Body (Condition) implies Head (Result)
Where body and head are conjunctions items. Direction of the rule matters!
If {Diapers} {Beer} Then
Body/ LHS Head/ RHS
Consider the rule A =>B o Support (“Co-occurrence”)
o Confidence (“Conditional occurrence”)
o Expected Confidence
o Lift
Recap: Association Rule Evaluation Criteria
P(A,B)
P(B|A)
P(B)
P(B|A) P(A,B) P(B) = P(A)P(B)
12
13
•Market basket Analysis • Most important part of a business: what merchandise
customers are buying and when? •Association Rules
• Building association rules • How good are association rules
•Clustering • Group similar items • Consumer Segmentation
Key: What is Similarity?
The quality or state of being similar; likeness; resemblance; as, a similarity of features. -- Webster's Dictionary
15
case gender glasses Moustache smile hat
1 0 1 0 1 0
2 1 0 0 1 0
3 0 1 0 0 0
4 0 0 0 0 0
5 0 0 0 1 0
6 0 0 1 0 1
7 0 1 0 1 0
8 0 0 0 1 0
9 0 1 1 1 0
10 1 0 0 0 0
11 0 0 1 0 0
12 1 0 0 0 0
Each user is represented as a “feature vector”
16
Need a distance measure for different cases (ectors)
Example for a distance measure: the Euclidean distance.
∑=
−=n
iii yxYXD
1
2)(),(
X = [x1 x2 x3 x4 x5 …] Y = [y1 y2 y3 y4 y5 …]
case gender glasses Moustache smile hat
1 0 1 0 1 0
2 1 0 0 1 0
Each cluster is represented by the cluster center - mean Cluster center: the “average” object in the cluster
eg, Clustering customers: (1) age (2) income. the cluster center is a “virtual average customer” with the
average age of the customers in the cluster and the average income.
Income
Age
Cluster center
18
Market basket Analysis • Most important part of a business: what merchandise
customers are buying and when? Association Rules
• Building association rules • How good are association rules
Clustering • Group similar items • Consumer Segmentation
Model-Free Data Exploration & Visualization Unsupervised Learning (Pattern Discovery) (Market Basket Analysis, Association Rule, Clustering)
Supervised Learning (Predictive Modeling) (Decision Tree, Linear Regression, Logistic Regression)
23
Prediction Overview Classification (vs. Regression)
Decision Trees Regression
• Linear Regression • Logistic Regression
Naive Bayes SVM K-Nearest Neighbor (KNN) …
Balance<50K
Balance>=50K
16 bad 14 good
4 bad 13 good
12 bad 1 good
Age<45
Age>=45
16 bad 14 good
3 bad 11 good
13 bad 3 good
Which Attribute to Choose?
27
Entropy and Information Gain Entropy How mixed/noisy is a set? (“uncertainty”) -Originally defined to account for the flow of energy through a thermodynamic process.
Assume there are two classes, Pink and Green When the set of examples S contain p elements of class Pink and g elements of class Green
Information Grain Expected reduction in entropy. e.g., How much closer to purity?
2 2( ) log logp p g gE Sp g p g p g p g
= − − + + + +
Grow a Tree: o At each step, choose the variable (decision
criterion) that will lead to the highest “gain” in
information (“variable worth”).
Prune a Tree: o Avoid overfitting choose the proper # of leaves
that will lead to the lowest validation error (not
training error).
Form a group of 4-6; Write a summary report about the major
methodologies of machine learning methods, including both supervised and unsupervised learning;
Compare the pros and cons for each method; Find a real-world business application of each
method you discuss; Page limit 5-10 pages, in English; Due: Last day of class.
29
30
Prediction Overview Classification (vs. Regression)
Decision Trees Regression
• Linear Regression • Logistic Regression
Regressions offer a different approach to prediction compared to decision trees. Regressions, as parametric models, assume a specific association structure between inputs and target.
Balance
Age
Split over age
Split over balance
50K
45
Balance
Age
50K
45
Linear relationship: Outcome (dependent/criterion) variable is a linear combination of predictor (independent) variables.
In two dimensions (one predictor, one outcome variable) data can be plotted on a scatter diagram.
y = β0 + β1 x + ε
Expected value of y (outcome)
Intercept Term coefficient
Predictor variable
Regression Model y = β0 + β1x +ε
Unknown Parameters β0, β1
Sample Data: x y
x1 y1 . .
. . xn yn
b0 and b1 provide estimates of
β0 and β1
Estimated Regression Equation
Sample Statistics
b0, b1
0 1y b b x= +
E(y): Outcome
x: Predictor
Slope b1 negative
Regression line Intercept b0
Simple Linear Regression Equation:
•
• • •
• • •
•
• •
•
•
• •
• •
• •
•
•
• •
•
•
•
• •
•
•
•
E(y)
x
Slope β1 is 0
Regression line Intercept β0
Simple Linear Regression Equation: No Relationship
E(y)
x
Slope β1 is 0
Regression line Intercept β0
• • • • • • •
• • • •
• • • • • • • • •
• • • •
• • • • •
•
Simple Linear Regression Equation: No Relationship
How good is the model? How big is the error of the prediction?
Residual e = observed (y)- predicted (E(y))
y
E(y) ( )e Y E Y= −
Final Goal: Good Fit?
Minimize the Error (Residual e)
Sum (mean) of absolute errors |e1|+|e2|+|e3|…; Sum (mean) of squared errors e1
2+ e22+ e3
2…;
Linear Regression (Ordinary Least
Squares,最小二乘法)
( )e Y E Y= −
Predicted Line
Actual Data
Least Squares Criterion: minimize sum of squared errors (vertical distance between actual data & estimated line)
^ 2min ( )i iy yn
−∑
b1 - Slope for the Estimated Regression Equation
b0 - Intercept for the Estimated Regression Equation
0 1b y b x= −
where: xi = value of independent variable for i-th observation
n = total number of observations
_ y = mean value for dependent variable
_ x = mean value for independent variable
yi = value of dependent variable for i-th observation
− −=
−
∑∑1 2
( )( )
( )
i in
in
x x y yb
x x
More than one predictor…
E(y)= α + β1*X + β2 *W + β3 *Z… Each regression coefficient is the amount of change
in the outcome variable that would be expected per one-unit change of the predictor, if all other variables in the model were held constant.
Market Share = α + β1*Price + β2 * Rating + β3 *#Reviews…
eg. Buy vs. No buy? What is the probability for a consumer to buy a product? – Can we use simple Linear Regression?
0 ≤ Pr(x) ≤ 1
Bounded by 0 and 1 Nonlinear
-- Need proper transformation!
Similar to linear regression, two main differences: Y (outcome or response) is binary Yes/No Approve/Reject Responded/Did not respond
Result is expressed as a probability of being in either group.
0 1 1 2 2( )u f Y b b X b X= = + + +
0 1 1 2 2
0, 1YY b b X b X
=
= + + +
u f(Y) Find a transformation of Y ( regular linear regression)
u ∊ (- ∞,+ ∞)
Probability of Success (Odds): - probability of success for case i.
ˆ ( 1| ) 1
u
ueP Y X
eπ = = =
+
π[0,1]
Odds-Ratio for Success: Ratio of the probability of success over the probability of failure.
ˆ ( 1| ) ˆ(1 ) ( 0 | )
uP Y XP Y X
eππ
== = − =
(0,+∞)
“Logit”: Log Odds-Ratio, taking the natural log of both sides, we can write the Logistic regression equation as
For Each Case i: (u regular linear regression)
: (), ( ) ( , ).Goal Find f so that u f Y= ∈ −∞ +∞
0 1 1 2 2ˆ ( 1| )ln ln ...
ˆ(1 ) ( 0 | )P Y X u b b X b XP Y X
ππ
== = = + + + − =
(- ∞,+∞)
f(Y)
0 1 1ˆ ( 1| )ln ln
ˆ(1 ) ( 0 | )P Y X u b b XP Y X
ππ
== = = + − =
For a single predictor
0 1 1 2 2ˆ ( 1| )ln ln ...
ˆ(1 ) ( 0 | ) k kP Y X u b b X b X b XP Y X
ππ
== = = + + + − =
For multiple predictors
Hotel Booking Prediction: • Hotel A, $150 per night, 4-class, near beach, downtown, … • Hotel B, $200 per night, 5-class, near highway, swimming pool,… • … • Hotel N, … Target: Booking decision (for a hotel j on a date t)- Yes/No?
Date HotelID Beach Downtown Class Price … Booking 9/1/2013 00001 1 1 4 150 Yes 9/1/2013 00002 0 1 5 200 No
… … … … … … … … 9/24/2013 0000N 0 0 2 95 Yes
… … … … … … … … 10/1/2013 10001 1 1 5 220 … ???
0
0
exp( ... Price )ˆ ( 1| ) ~
1 exp( ... Price )k jt
jt jt jtk jt
b bP Y X
b bπ
+ += =
+ + +
0 1 2 3
ˆ ( 1| )ln ln
ˆ(1 ) ( 0 | )
... Price
jt jt jt
jt jt jt
j j j k jt
P Y XP Y X
b b Beach b Downtown b Class b
ππ
==
− = = + + + +
Hotel Booking Prediction: Target: Booking decision (for a hotel j on a date t)- Yes/No?
Market Share for hotel j on date t
“Logit Model”: Combining Logistic Regression with Economics of Discrete Choices - Daniel McFadden won Nobel Prize in Economics (2000).
Date HotelID Beach Downtown Class Price … Booking 9/1/2013 00001 1 1 4 150 Yes 9/1/2013 00002 0 1 5 200 No
… … … … … … … … 9/24/2013 0000N 0 0 2 95 Yes
… … … … … … … … 10/1/2013 10001 1 1 5 220 … ???
Interpretation of Regression - Be Careful!
How to Interpret Regression Results? Causal Effect Identification Strategies; Economic Value of Online Word-of-Mouth; Social Network Influences; Multichannel Advertising Attribution; Randomized Field Experiments of Mobile Recommendations;
Interpretation of Regression - Be Careful!
How to Interpret Regression Results? Causal Effect Identification Strategies; Economic Value of Online Word-of-Mouth; Social Network Influences; Multichannel Advertising Attribution; Randomized Field Experiments of Mobile Recommendations;
56
Example 1: Economic Effects of Word-of-Mouth
References: “Reviews, Reputation, and Revenue: The Case of Yelp.com” Michael Luca, Harvard Business School. “Learning from the Crowd: Regression Discontinuity Estimates of the Effects of an Online Review Database” Michael Anderson & Jeremy Magruder, UC Berkeley.
NYTimes 9/2012 HBS 2011
Question: How do online reviews affect demand for restaurants? (Yelp, SFO) http://www.guardian.co.uk/lifeandstyle/2012/sep/02/ratings-boost-restaurants
Finding: Half-star increase in rating makes it more likely to be full during peak dining hour.
Challenge – Causality: “After all, restaurants that get good reviews are those with good quality anyway, and they would probably do well even in the absence of any reviews…“
Confounding Factors – unobserved quality, advertisements, …
0 1 2Price ?j j j jSales Ratingβ β β ε= + + + +
NYTimes 9/2012 HBS 2011
Question: How do online reviews affect demand for restaurants? (Yelp, SFO) http://www.guardian.co.uk/lifeandstyle/2012/sep/02/ratings-boost-restaurants
Finding: Half-star increase in rating makes it more likely to be full during peak dining hour.
70
“A special ‘get out the vote’ message, showing each user pictures of friends who said they had already voted, generated 340,000 additional votes nationwide —(Democrats or Republicans)”
72
Vote Obesity Smoking Purchase iPhone 6 Go to graduate school Get married Get divorced …
Is there a “peer influence” ?
73
Regression tells you only “correlation”
The causality is not clear…
How to identify the causal influence?
76
Causal Identification “Propensity Score” Matching: Create a “matched sample” of treated and untreated groups, where “treatment” is having 1 (or 2, 3, 4 or more) friends who smoke.
Match the treated with untreated group who are likely to have been treated, conditional on a vector of observed features, but who were not treated. (Similar subjects across groups, except one group under treatment)
Reaching customers more often across more channels will inevitably improve your brand recognition. [Digital Marketing Insight]
New social media marketing channels: e.g., Facebook... (100% of the Ad Age Top 100 Advertisers have
established Facebook pages for their brands)
Example 3: Multichannel Advertising (WCAI)
Brands can attract hundreds, thousands, or even millions of “LIKEs” on FB.
HERSHEY'S: 6,037,545 likes Coca-Cola: 72,299,176 likes Disney: 45,049,811 likes Nike: 15,327,236 likes ……
Example 3: Multichannel Advertising
Are the fans really fans?
Do FB LIKEs/brand exposure lead to more sales?
Starbucks Fans: 4.2 times more likely to visit Starbucks.com; Southwest Fans: 3.6 times more likely to visit Southwest.com. - comScore (2011)
Example 3: Multichannel Advertising
o Audience self-selection (e.g., being targeted or clicking “LIKE” due to inherent brand affinity)
o Other unobserved confounding factors
Lack of effective metrics in measuring brand advertising effects
Causal effect identification
Bias the estimated effect of marketing effort.
Overlapping of advertising campaigns from other media channels;
e.g., TV advertising, online display ads… Multiple media channels tend to largely interact with
each other.
Bias the estimated effect of marketing effort.
It becomes increasingly difficult to uniquely identify the effects of different types of marketing efforts on brand purchases.
Brand Purchases
Confounding Factors
Social Media
Online Display TV Ads Synergy (not main focus)
Interdisciplinary Causal Analysis
We propose to use an interdisciplinary approach combining economic modeling with machine learning methods for counterfactual analyses, to examine the synergetic effect of social media exposure on brand sales, in the context of multiple-channel advertising. In particular, we examine advertising effect by looking at three different channels: Facebook Online display (banner, pre-roll) Offline TV
84
• 12 months panel (2011/1/1-2011/12/31) • FB, online, TV, purchase, sociodemographics • Both categories, both sponsor’s umbrands and competitors’ umbrands • Leading to 4 umbrands. • Goal: Daily FB exposure brand purchases in next 30 days.
Example 3: Multichannel Advertising
85
We have built different econometric and machine learning models for counterfactual analysis to study the synergetic effect of FB exposure on brand purchase from causal perspective:
Propensity Score Matching Diff-in-Diff Hazard Model Targeted Learning
Example 3: Multichannel Advertising
First, we use a propensity score matching (PSM) approach, where we match customers based on Socio-demographic attributes; TV ads exposure ; Online display ads exposure;
87
The idea is to simulate a randomized experiment -- an controlled group and a treatment group, with the same predicted propensity of getting treated (i.e., FB exposure).
However, the only difference between the two groups is that one is treated, whereas the other is not.
Based on the matched samples, we use a diff-in-diff model to control for both time-invariant and time-varying unobserved factors.
First-level difference: Monitor changes in brand sales over time for each group group-fixed effect
Second-level difference: Monitor discrepancy in the above changes between the two groups group-time fixed effect
88
89
We find if we consider all the four brands and estimation the overall FB effect, the estimates are insignificant.
Our guess – FB exposure may have different effects on sales for different brands.
Therefore, we further break down and look into individual brand separately.
90
Model FB Treatment Effect Std. Err Diff-in-Diff -0.2635* 0.0274 Hazard 0.0032* (hazard ratio) 0.0023 Hazard (Time-varying covariates)
1.0004 (hazard ratio) 0.0042
GCOMP(MLE) -- -- IPTW -0.0037* 0.0000 DR -0.0173* 0.0000 TMLE -0.0105* 0.0022
We find FB effects vary across multiple brands –(statistically significant and negative).
Brand 32 (Well-established Brand):
91
Model FB Treatment Effect Std. Err Diff-in-Diff 0.1933* 0.0041 Hazard 1.0168* (hazard ratio) 0.0005 Hazard (Time-varying covariates)
1.0112 (hazard ratio) 0.0061
GCOMP(MLE) -- -- IPTW 0.2575* 0.0000 DR 0.2019* 0.0000 TMLE 0.1821* 0.0012
We find FB effects vary across multiple brands –(statistically significant and positive).
Brand 73 (New, Less-established Brand):
92
FB exposure has statistically significant effects on brand sales
on certain brands (not all). However, it seems the efforts are not significant in magnitude.
Effects of FB exposure vary across brands. (Not always positive! – May depend on the brand page activities.)
Negative FB effect exists on next-30-day purchase for certain (more well-known/established) brands.
Traditional advertising channels (TV/Online) have larger effects than FB.
In general, findings are highly consistent across multiple causal analytical models.
93
We have tried # of Exposures (and log version as well) rather than
a binary indicator for each type of exposure (FB/Online/TV);
We have broken down different types of FB exposure (owned/earned), different types of online exposure (banner/pre-roll/FB banner);
We have tried different sliding time windows for purchase measurement (next 1, 3, 7, 14, 30 days);
Example 3: Multichannel Advertising
Quantify the synergetic effects of FB exposure, taking into account of multi-channel advertising efforts;
Causal analyses including propensity score matching, diff-in-diff, Hazard model and machine learning methods;
FB brand advertising (through FB fan page own/earned messages) may not always improve brand sales (effects also vary across brands);
Traditional advertising channels, especially TV, are still the major effective marketing channels.
Example 3: Multichannel Advertising
96
Causal Identification “Propensity Score” Matching: Create a “matched sample” of treated and untreated groups, where “treatment” is having 1 (or 2, 3, 4 or more) friends who smoke.
Match the treated with untreated group who are likely to have been treated, conditional on a vector of observed features, but who were not treated. (Similar subjects across groups, except one group under treatment)
Need Data!
97
Causal Identification
Randomized Experiment: Control Group Treatment Group (Manipulation: e.g., Expose to Facebook ads)
Need randomized group assignment!
Many companies are doing it today. e.g., A/B Testing: Facebook, Travelocity, Yelp, etc.
98
Example 4: Randomized Field Experiments
Business Question:
How Effective is Mobile Recommendation?
A major large shopping mall in Beijing ◦ 120,000 square meters ◦ 100,000 visitors per day; 200,000 visitors per day during
holidays ◦ WiFi localization system ◦ 300+ stores
Example 4: Randomized Field Experiments
Group 0: Randomly select 1000 consumers, do nothing, observe
Group 1: Randomly select 2000 consumers, send random store sale information
Group 2: Pure location-based recommendation ◦ Send store promotion messages to randomly selected 2000
consumers based on purely real-time location. Group 3: Trajectory-based recommendation ◦ Send promotion messages to randomly selected 2000
consumers based on our recommendation system.
Trajectory-based Mobile Recommendation (TMR) leads to
Higher promotion response rate, faster redeem action; Overall more time spent in the mall; More revenue to focal store; Higher overall satisfaction.
How to Interpret Regression Results? Causal Effect Identification Strategies; Economic Value of Online Word-of-Mouth; Social Network Influences; Multichannel Advertising Attribution; Randomized Field Experiments of Mobile Recommendations;
Day 1: BI & DA Overview, Business Cases Day 2: Machine Learning & Data Mining Basics - Group Assignment (1) Day 3: Predictive Modeling vs. Causal Inferences - How to Interpret Regression Results - Causal Identification Strategies; - Economic Value of Online Word-of-Mouth; - Social Network Influence; - Multichannel Advertising Attribution; - Randomized Field Experiment of Mobile Recommendation. Day 4: Bridging Machine Learning with Social Science: - Case 1: Interplay Between Social Media & Search Engine; - Case 2: Understand and Predict Consumer Search and Purchase Behavior; - Case 3: Text Mining & Sponsored Search Advertising.