based in part on joint work with: denis nekipelov (uva, msr) guido imbens (stanford gsb) stefan...

Based in part on joint work with:Denis Nekipelov (UVa, MSR)Guido Imbens (Stanford GSB)Stefan Wager (Stanford)Dean Eckles (MIT)

Based in part on joint work with:Denis Nekipelov (UVa, MSR)Guido Imbens (Stanford GSB)Stefan Wager (Stanford)Dean Eckles (MIT)

Data-Driven Market Design

Susan Athey, The Economics of Technology Professor, Stanford GSB

Consulting researcher, Microsoft Research

Marketplaces• Uber, Lyft, Airbnb, TaskRabbit, Rover, Zeel, Urbansitter• Two groups of customers

– Cross-side network effects

Auction-based platforms• Online advertising• Used cars• eBay

Market design matters

Introduction

eBay examples

• Eliminate listing fees

• Make pictures free

• Change search algorithm– Emphasize price– Force sellers into more uniform categories

• Shipping costs

Search advertising examples

• Change broad match criteria for looser matching

• Change pricing

Market Design Examples: Short Run v. Long Run

4STANFORD GRADUATE SCHOOL OF BUSINESS

See, e.g.:“Asymmetric Information, Adverse Selection and Online Disclosure: The Case of eBay Motors,” Lewis, 2014“Consumer Price Search and Platform Design in Internet Commerce,” Dinerstein, Einav, Levin, Sundaresan, 2014“A Structural Model of Sponsored Search Auctions,” Athey & Nekipelov, 2012

Theoretical framework is key• Makes arguments coherent and precise

• Identifies equilibrium/long-term effects

• Advertiser/sellers and consumer choices incorporated

Complement with data

Influencing Market Design

Data can inform what kinds of designs will work better or worse in range of environments similar to existing one

Advocacy for design issues is much more effective with theory and data combined

Experimentation crucial but also has limitations• Short-term experiments can’t show long-term outcomes, feedback effects

• Can help gain insight by analyzing heterogeneity of effects

One part of empirical economics focuses structure on empirical analysis in order to learn model “primitives” and perform “counterfactuals”

• Learn bidder values, predict equilibrium responses

• Map between short-run user experience and long-term willingness to click

Data, Experimentation, and Counterfactuals

A/B Testing


100% users

Control: existing system

Treatment: Modified system

User interactions instrumented, analyzed, and

compared

50% 50%

Results analysis and future design decisions

Control average

outcome

Treatment average

outcome

Short Term A/B Tests Have Limitations: Search Advertising Case Study

Standard industry practice

– Run A/B tests on users or page views to test new algorithms

– Make ship decision based on revenue & user impact


Multiple errors– Unit of experimentation is not unit

of analysis• Each advertiser only sees change

on 1% of traffic

– Interaction effects ignored, fixing bids & budgets• When released to market, other

advertisers might hit budget constraints, causing a given advertiser to hit their own

– Reactions of bids and budgets ignored

– Long term participation ignored



Cheaper fixes (motivated by theory)- Modify evaluation criteria (short term metrics)

- Instead of measuring actual short-term performance, focus on part that is correlated with long-term

- E.g. only count good clicks- Theory: advertisers won’t pay for bad clicks in equilibrium

- Do a small number of long term studies to relate short term metrics to long term

- Up-front study only captures responsiveness to types of changes observed in study—can’t answer all questions

- Add in constraints that “protect” advertisers- E.g. constrain price increases at the advertiser level - But advertisers very heterogeneous in preferences and

responsiveness



Expensive solution I: Long-term advertiser-based A/B test

- Apply treatments to randomly selected advertisers (stratify)

- Watch for a long time

- Problems:• Ignores advertiser interactions!!

• Unclear whether conclusions will generalize once other advertisers respond

• Expensive and disruptive to advertisers under experimentation

• Takes a long time (advertisers respond slowly), interferes with ongoing innovation


eCommerce Markets as Bipartite Graphs

Query A

Query B

Query C

Sellers

Testing Interactions:Athey, Eckles, Imbens (2015)

• How to do inference properly for various hypotheses?– Method for exact p-values for a class of non-sharp null hypotheses.

• Exact: no large sample approximations• Sharp: outcome for each unit known under the null for different treatment assignments

– Test hypotheses of the form • “Treating units j related to unit i in way Z has no effect on i” • “Only fraction of neighbors treated matters, not identity or their network position”

– Novel insight• Can turn non-sharp null into a sharp one by defining an artificial experiment and

analyzing only a subset of the units as “focal” units

• Best ways to analyze existing experiments– Most powerful test statistics

• Develop model-based test statistics that perform better than commonly used heuristics.• Insight: write down a structural model and use score-based test

– How to use exogenous variation that is in the data most effectively

Unit Yi(0 FOF*)

Yi(>1 FOF*)

Aux Unit

Aux Wi

Alt. assignments of FOF Wi

1 2 3 4 5 6

A 3 3 C 1 1 0 1 0 0 1

B 2 2 D 0 1 0 0 1 1 0

*Holding fixed own treatment and friends’ treat

F 1 0 1 0 1 0 1

G 0 0 1 1 0 1 0

Probabilities 1/6 1/6 1/6 1/6 1/6 1/6

Test statistic: Edge Level Contrast for FOF links

between Focal and Auxiliary units

1/3 8/3-7/3=1/3

7/3-8/3=-1/3

5/2-5/2=0

5/2-5/2=0

7/3-8/3=-1/3

8/3-7/3=1/3

Testing Hypotheses About Friends of Friends

I

C

E

D

B

H

AF

G

Focal unit A

Focal unit B

Auxiliary to Focal units A and B

Auxiliary to Focal unit B

Auxiliary to Focal units A and B

Buffer for Focal unit B

Buffer for Focal units A and B

Buffer for Focal unit A

Auxiliary to Focal unit A

Aux FOF treat v. control:A has C,F v. DB has F v. D,G



Expensive solution II: Long-term market-based A/B test

Community Detection

Modularity-based algorithm identifies clusters: Modularity =

: fraction of links with both nodes inside the community

: number of links with at least one node in community

Clustered Randomization and Community-Level Analysis

Treated

Treated

Control

Ugander et al (2013) proposal:

• Define “treatment” as having high share of friends treated

• Use propensity score weighting to adjust for non-random assignment to this condition

This is open research area



Expensive solution II: Long-term market-based A/B test

- Cluster advertisers- Bipartite graph weighting links between advertisers and

queries- Weights are clicks or revenue- Issue: spillovers large

- Experiment at cluster level

- Watch for a long time- Problems:

• Expensive and disruptive to advertisers under experimentation• Takes a long time (advertisers respond slowly), interferes with

ongoing innovation

Further Approaches

1. Gain insight from A/B tests by studying heterogeneity of effects

– Understand how innovation affects different advertisers and queries

– Relate to other work showing which types of advertisers are most responsive

– Can also explore variety of scenarios interacted with advertiser characteristics


2. Build a structural econometric model and do counterfactual predictions

– Requires investment in model development and validation

– Relies on assumptions, may not be accepted even if validated

Experiments and Data-Mining

• Concerns about ex-post “data-mining”– In medicine, scholars required to pre-specify analysis plan

– In economics, calls for similar protocols

• But how is researcher to predict all forms of heterogeneity in an environment with many covariates?

• Goal of Athey & Imbens 2015, Wager & Athey 2015:– Allow researcher to specify set of potential covariates

– Data-driven search for heterogeneity in causal effects with valid standard errors

– See also Langford et al, Multi-World Testing

Segments with similar treatment effects

• Data-driven search for subgroups

• Pro: Interpretability, communicability, policy, inference with moderate sample sizes

• Con: Not the best predictor for any individual; segments unstable

Fully personalized predictions

• Non-parametric estimator of treatment effect as function of covariates

• Pro: Best possible prediction for each individual, and can do inference as per Wager-Athey 2015

• Con: Confidence intervals have poor coverage with too many covariates; hard to interpret and communicate

Our contributions:

• Optimize existing ML methods for each of these goals

• Deal with issue: no observed ground truth

• Methods provide valid confidence intervals without sparsity

Segments versus Personalized Predictions


Regression Trees for Prediction

Using Trees to Estimate Causal Effects

Model:

• Random assignment of Wi

• Want to predict individual i’s treatment effect – This is not observed for any individual

• Let

Using Trees to Estimate Causal Effects

• Approach 1: Analyze two groups separately– Estimate using dataset where

– Estimateusing dataset where

– Do within-group cross-validation to choose tuning parameters

– Construct prediction using

• Approach 2: estimate using tree including both covariates– Choose tuning parameters as usual

– Construct prediction using

– Estimate is zero for x where tree does not split on w

Observations Estimation and cross-validation not

optimized for goal Lots of segments in Approach 1:

combining two distinct ways to partition the data

Problem 1 What is a candidate estimator for

Problem 2 How do you evaluate goodness of fit

for tree splitting and cross-validation?

is not observed and thus you don’t have ground truth for any unit

Causal Trees

• Directly estimate treatment effects within each leaf

• Modify splitting criterion to focus on treatment effect heterogeneity

• Cross-validation criterion must estimate ground truth– Build on statistical theory

• Honest trees: one sample to split, another to estimate effects, yields valid confidence intervals– Anticipating honesty changes algorithms

• Result: for any ratio of covariates to observations and without sparsity assumptions, can discover meaningful heterogeneity and produce valid confidence intervals

Swapping Positions of Algo Links: Basic Results

Position 1 (natural) Position 3 Position 5 Position 100.0%

10.0%

20.0%

30.0%

25.4%

11.8%

7.5%

3.8%

Click-through rate of top link moved to lower position(US All Non-Navigational)

CTR

Moving a link from position 1 to position 3 decreases CTR by 13.6

percentage points

Search Experiment Tree: Effect of Demoting Top Link (Test Sample Effects) Some data

excluded with prob p(x): proportions do not match population

Highly navigational queries excluded

Use Test Sample for Segment Means & Std Errors to Avoid Bias

Variance of estimated treatment effects in training sample 2.5 times that in test sample

Test Sample Training SampleTreatment

EffectStandard

Error ProportionTreatment

EffectStandard

Error Proportion-0.124 0.004 0.202 -0.124 0.004 0.202-0.134 0.010 0.025 -0.135 0.010 0.024-0.010 0.004 0.013 -0.007 0.004 0.013-0.215 0.013 0.021 -0.247 0.013 0.022-0.145 0.003 0.305 -0.148 0.003 0.304-0.111 0.006 0.063 -0.110 0.006 0.064-0.230 0.028 0.004 -0.268 0.028 0.004-0.058 0.010 0.017 -0.032 0.010 0.017-0.087 0.031 0.003 -0.056 0.029 0.003-0.151 0.005 0.119 -0.169 0.005 0.119-0.174 0.024 0.005 -0.168 0.024 0.0050.026 0.127 0.000 0.286 0.124 0.000

-0.030 0.026 0.002 -0.009 0.025 0.002-0.135 0.014 0.011 -0.114 0.015 0.010-0.159 0.055 0.001 -0.143 0.053 0.001-0.014 0.026 0.001 0.008 0.050 0.000-0.081 0.012 0.013 -0.050 0.012 0.013-0.045 0.023 0.001 -0.045 0.021 0.001-0.169 0.016 0.011 -0.200 0.016 0.011-0.207 0.030 0.003 -0.279 0.031 0.003-0.096 0.011 0.023 -0.083 0.011 0.022-0.096 0.005 0.069 -0.096 0.005 0.070-0.139 0.013 0.013 -0.159 0.013 0.013-0.131 0.006 0.078 -0.128 0.006 0.078

What if we want personalized predictions?


From Trees to Random Forests (Breiman, 2001)

“Adaptive” nearest neighbors algorithm

Random forest

• Subsampling to create alternative trees– +Lower bound on probability each feature sampled

• Causal tree: splitting based on treatment effects, estimate treatment effects in leaves

• Honest: two subsamples, one for tree construction, one for estimating treatment effects at leaves– Alternative for observational data: construct tree based on propensity for assignment to

treatment (outcome is W)

• Output: predictions for

Main results (Wager & Athey, 2015)

• First asymptotic normality result for random forests (prediction), extends to causal inference & observational setting

• Confidence intervals for causal effects

Causal Forests

Applying Method to Long-Term Prediction

• Estimate a model relating short-term metrics to long-term behavior, incorporating advertiser characteristics

• Estimate heterogeneous effects of treatment in short-term test

• Map effects to long-term impact on actors (advertisers)

• Predict long-run responses based on responsiveness of the affected advertisers

• This method has difficulty with full equilibrium response


Approach 2: Build A Structural Model• Athey & Nekipelov (2012):

– Assume profit maximization, estimate values

• Athey & Nekipelov (2014, in progress): – Specify a set of objectives

– Estimate Bayesian model of objective type and parameters• Use experiments and algorithmic releases to identify objectives—different

objectives predict different reactions to change

– Model the decision to change bid

• Both: use model to predict how advertisers respond to short term metrics– Do short-term experiment

– Calculate new equilibrium based on changes

– Assume small interactions among advertisers are zero to approximate numerically


Summary: A Structural Model

• Findings– Substantial set of bidders predictably non-responsive in

medium term (high implied cost of bid changing)

– Exact match advertisers optimize for position, while broad match optimize for ROI from clicks

– Short-term experiments and long-term counterfactuals can go in opposite direction

– Examples: switch to Vickrey auction, improving accuracy of click predictor


Conclusions

• Power of A/B Testing leads to a culture of relying heavily on experiments

• Standard experiments are not appropriate for many problems

• Expensive to use correct experimentation approach

• Layering analytics and structural models on top of experiments is cheaper

• But a culture of short-term experiments leads to resistance to non-experimental analytics, leading to short term focus for innovation

• Advice:– Use rules of thumb to trigger when more costly long-term approaches brought to bear

– Take every opportunity to study long-term effects, e.g. use staggered rollouts

– Study heterogeneity and build insight


based in part on joint work with: denis nekipelov (uva, msr) guido imbens (stanford gsb) stefan...

Documents