a causal framework for digital attribution · a causal framework for digital attribution joseph...

A Causal Framework for Digital Attribution

Joseph Kelly, Jon Vaver, Jim Koehler

Google LLC

Abstract

In this paper we tackle the marketing problemof assigning credit for a successful outcome toevents that occur prior to the success, other-wise known as the attribution problem. In theworld of digital advertising, attribution is widelyused to formulate and evaluate marketing but of-ten without a clear specification of the measure-ment objective and the decision-making needs.We formalize the problem of attribution undera causal framework, note its shortcomings, andsuggest an attribution algorithm that is evalu-ated via simulation.

1 Introduction

Advertisers have a primary need to know howbest to allocate their advertising resources tomaximize the return on their investment. In dig-ital advertising, this often translates into esti-mating how effective advertising campaigns areat increasing the number of purchases made byconsumers. This measurement is frequently es-timated using digital attribution. (Analytics,2018)

Figure 1: Consumer Path Example

Digital attribution is the process of assign-ing credit for a successful outcome (known asa conversion) to observed digital consumer en-gagements that occurred prior to the convertingevent. For example, in Figure 1, the advertiser

would like to know how much credit for the con-version should be shared between the Display AdImpression, Search Ad Impression and a DirectNavigation to the website. This event level at-tribution credit may be aggregated across pathsor other event level features, such as a keywordor geographic location, and used by advertisersto make decisions about allocating funds.

Advertisers have many choices when it comesto attribution modeling. The simplest are therules-based models: the last event model, whichgives all credit to the event just prior to theconversion, the first event model, which give allcredit to the first observed event, and the linearmodel, which assigns equal credit across all ob-served events (Analytics, 2017). There are othermore complex models, including various forms ofdata driven attribution, such as the one we de-scribe in this paper.

2 Issues With Current Attribu-tion Practices

Although popular, and easy to use in practice,there are significant drawbacks related to therule-based models and current attribution prac-tices. The first is that the concept of credit isnever defined in a principled way and so it is un-clear what each algorithm is actually trying toestimate. This leads advertisers to use attribu-tion results to solve a myriad of problems. Forexample, an advertiser may use last-click attri-bution from a set of data to evaluate the valueof an entire ad campaign, decide how to allocatespend across multiple advertising campaigns oreven figure out what to bid in an ad auction. By

1

not having a strict definition of what an attri-bution algorithm is trying to achieve, we are leftwith an algorithm that is used to solve multipleproblems, while not being suitable to solve anyone of them.

In addition to this ambiguity in the mea-surement objective, there are additional prob-lems related to entrenched reporting practices.Advertisers want an automated and continuousmethod of measurement with unequivocal resultsthat are easy to interpret. These expectationshave lead to the establishment of requirementsfor attribution that imply underlying assump-tions that do not hold in practice and, in fact, un-dermine the task of measuring ad effectiveness.These include the requirements:

• credit must add to the total number of re-ported conversions, both at the path leveland in aggregate,

• credit is non-negative at the event level, andtherefore also at the channel level,

• credit is additive and no credit is explicitlyassigned to ad interactions,

• credit is not assigned to events in non-converting paths.

These restrictions reflect the desire from ad-vertisers to have a simple measure of how theiradvertising works, but accommodating this de-sire leads to implicit assumptions that are lim-iting and unlikely to hold in practice. For ex-ample, by only considering non-negative cred-its, attribution models are implicitly assumingthat advertising cannot have a negative effect.This is reasonable since the goal of advertisingis to drive conversions. However, there can beinstances in which an ad may cause a user notto buy a product, and if this is the case, this is avery strong and useful piece of information thatis lost under existing attribution measurement.This non-negative constraint can also cause is-sues in the estimation process when advertisingis completely, or close to being, ineffective. Dueto the probabilistic nature of user behavior, hav-ing some negative credits is appropriate and ex-pected, even if the overall ad campaign had a

positive effect in expectation. Prohibiting neg-ative credit from occurring at all places an un-reasonable constraint on the estimation and canlead to bias.

Continuing, the fact that attribution meth-ods attribute a credit to each individual event,rather than assigning credits to combinations orsequences of events, also indicates that an inde-pendence assumption is being made. That is, thesum of marginal effects are guaranteed to equalthe joint effect of all events only if the underlyingevents are independent. This is quite unlikely inpractice, especially if the notion of an advertisingfunnel holds true.

Lastly, the notion that it is necessary to ac-count for every conversion both at the path andaggregate level can cause serious issues. Namely,it implies that all relevant events and pieces of in-formation that went into the decision for the userto convert is completely observed and presentin the user’s path. That is, if unpaid (non-ad)events are present in the path then the assump-tion is that credit that can’t be attributed toads can be explained by these observed unpaidevents. If only paid events are present then theassumption is that no external or user actionscaused the user to convert other than the ob-served advertising interventions.

This requirement also ignores the user’spropensity to convert. As an extreme exam-ple, consider the situation in which only paidad events are in the path and these have no im-pact on user behavior. That is, a user convertsfor reasons not related to the ad and thus bydefinition the events preceding a conversion areindependent of the conversion status. Reportingrequirements dictate that the credit for the con-versions will be divvied up among these eventseven when we know they had no effect. Data-driven methods should utilize all information inthe data and, as is the case in Sapp and Vaver(2016), take the approach of considering non-converting paths in assigning credit. While thisis a positive step for attribution, the goal of ac-counting for every conversion remains along withthe problems associated with this expectation.

Beyond these issues with reporting, adver-tisers often choose an attribution model based

2

on preconceived notions about how attributioncredit should be allocated, which can lead topoor decision-making. A better approach is tochoose an attribution model based on its demon-strated ability to answer a specific decision-related question. The approach we take in thispaper is to choose a single measurement goal,which is to estimate the number of additionalconversions generated by a single ad channel.Here an ad channel is defined as a grouping ofsimilar campaigns that the attribution algorithmtreats as equivalent. This is inherently a causalquestion, so answering it requires defining coreconcepts such as credit in terms of a causal ef-fect.

3 Attribution Through Experi-mentation

We take the viewpoint that attribution is actu-ally trying to estimate the effectiveness of dif-ferent types of advertising. Namely, to estimatethe effect that an intervention (an ad) has onan outcome of interest (the conversion) and assuch it is inherently a causal inference problem.The causal framework we outline is extremelyimportant as it reflects the goal of attributionto comment on the effects of actions made byadvertisers not merely to find which events arecorrelated with conversions.

Experimentation is the gold standard in causalinference, so attribution is best understood whenviewed as a broken randomized experiment.That is, let us imagine what experiment wewould do to estimate the effect of advertising andthen let’s examine how the observed data fromattribution products fit into that experiment.This approach makes it clear when, and how,observational methods of measurement, such asattribution models, are appropriate. It requiresthe clear definition of an estimand and makesit possible to conduct simulation studies, as de-scribed in Section 7, that can be used to evalu-ate attribution models with experiments that arerun on simulated path data. Most importantly,this causal motivation guides the development ofnew attribution methods that explicitly resolve

discrepancies between the assumptions of the ex-periment and the attribution model.

Advertising is often organized such that mul-tiple ad channels are concurrently active. Thisproposal is tackling one of the goals of attribu-tion; determine how valuable each channel is inaffecting a user’s propensity to convert. It ishypothesized that for each ad channel there isa corresponding experiment of interest where inone arm of the experiment the ad channel is ac-tive and in the other arm it is inactive. Theattributable value of the ad channel is defined asthe difference between the number of conversionsin the two experimental arms. Note that thismarginal effect of an ad channel is still relevantwhen there are multiple ad channels. Each adchannel can be active or inactive and thereforehas its own marginal effect even in the presenceof other active channels. Additional effects canbe defined with a more complex experiment inwhich multiple ad channels are concurrently ac-tive or inactive. This would allow for measuresof interactions between ad channels.

One such experiment would be a full factorialexperiment in which every combination of thead channel is active or inactive. In AppendixA, we present a proposal that describes a set ofcontrasts that measure the relative importanceof each channel across the full factorial experi-ment by combining main and interaction effects.This approach is very much in line with that pro-posed in Shapley (1953) and, although it parti-tions credit for the aggregate incremental conver-sions generated by the ad channels, it providesa different set of information to the advertiserthat is arguably less actionable. This is becausethe incremental effect of a channel is directly tiedto an advertiser action of turning a channel onor off. A Shapley value or the set of contrastsfrom a full factorial embedding multiple effectsis, by definition, a measure under many differentadvertiser actions. Each combination of adver-tiser actions represents a particular arm of thefull factorial experiment and, as only one arm ofthe experiment or set of advertiser actions canbe taken at any one time, this makes the infor-mation much less actionable. On the other hand,this full factorial approach does accomplish the

3

goal of providing a systematic objective for allo-cating conversion credit across paid ad channelsalong with a no-advertising conversion baseline.

We tackle one use case for attribution; the ef-fect of turning a whole channel on or off. Thisuse-case is in line with estimating the return onad spend (ROAS) at the channel level, which canadvise the advertiser regarding the worth of anentire channel. Unlike existing attribution meth-ods that assign credit to every event, the pro-posal addressed in this paper does not estimatethe effect of each ad instance at the ad servinglevel. Identifying a credit for each ad event wouldbe more useful for practices such as bidding on adserving opportunities in an auction environment.However, these credits may not accurately esti-mate the ROAS of an entire channel. The factthat current attribution methods often try to an-swer these two questions simultaneously with onealgorithm is a concern as equivalence exists onlyunder some strict assumptions. Avoiding thispitfall is the first step in the systematic design ofan attribution model. Although in this paper weonly consider the ROAS case both questions canbe formulated and answered within the causalframework proposed.

4 Causal Framework

As we indicated previously, attribution estimatesthe impact that an intervention has on an out-come and so it is natural to view it under theRubin Causal Model (RCM) framework (Rubin,1974). Under the RCM framework we can quan-tify the causal effect (the estimand), describehow to estimate this quantity (the estimator)and list assumptions such that the estimator hasdesirable properties such as unbiasedness.

An RCM aims to embed an observationalstudy into a hypothetical randomized experi-ment that happens to be broken in that the exactassignment mechanism of the treatment, in thiscase ad exposure, is unknown to the researcher.This framework also allows us to imagine whatvalue the outcome metric would have taken foreach experiment arm. While this framework canbe used to describe a multi-arm experiment with

each arm having a single channel turned off, forexposition purposes, we focus on turning on andoff a single channel in the presence of other chan-nels that always remain on. This experimentaims to replicate the behavior and interests ofadvertisers who have multiple channels runningon an ongoing basis and would like to know themarginal effect of a single channel in the pres-ence of others. Knowing how a channel con-tributes to the number of conversions can informthe marginal return on ad spend (ROAS) for thatchannel.

For the attribution problem, the experimentalunits are individual users who have the potentialto be served ads. In order to make the problemtractable, and for our inferences to be causal, it isnecessary to make assumptions about the under-lying data generation process. The first assump-tion we make is related to how the users and theircorresponding digital histories are sampled:

Assumption 1 (Simple Random Samplefrom a Superpopulation)N users are independently and identically sam-

pled from an infinite superpopulation. This isequivalent to setting an observation window andobserving all the users with events in the windowwhere the start and end dates of the window areindependent of the distribution of the collecteddata.

The second assumption is about the ad serv-ing mechanism and it is made by the majorityof existing attribution models (often implicitly),including the rules-based ones:

Assumption 2 (Stable Unit TreatmentValue Assumption - SUTVA)There are two components to this assumption.

The first is that there is no interference betweenusers, and the potential outcomes of one userare not affected by the treatment assignment ofanother. That is, whether or not one user seesan ad has no influence on whether a differentuser converts. Secondly, we assume that thereare no hidden variations of treatment and thatneither the label of treatment or the assignmenthas an effect on the potential outcomes. (Imbensand Rubin, 2015)

4

Under these first two assumptions we canequate the advertiser level experiment of turningoff an entire channel with the individual effect ofan ad at the user level. One consequence is thatan advertising channel can only affect a user ifthat user actually sees an ad. This means thata user will have the same number of conversionsif they do not see the ad when the channel isactive as they would have if the channel was in-active. This identification allows us to estimatethe effect of turning off an ad channel solely byobserving those users who did and did not seeads from the channel when it was active.

While it is likely that the first two assump-tions are required by most attribution modelsour third assumption could be relaxed and re-placed by an alternate modeling assumption.

Assumption 3 (Conditionally the Passageof Time has No Effect)Given the sequence of events, whether they be

ads or organic actions by the user, the time be-tween events is independent of conversion status.That is, the actual calendar time of the eventsdoes not matter as all of the relevant informa-tion is contained in the sequence of events.

The consequence of Assumption 3 is that we onlyneed to consider the sequence of events when do-ing attribution and not the timestamps of thosesaid events. We assume this largely for exposi-tional reasons and the theory outlined here canbe expanded to incorporate temporal informa-tion.

Our fourth assumption is done so for math-ematical reasons and is largely inconsequentialas its veracity is confirmed for any problem inpractice.

Assumption 4 (There are a maximum ofJ finite number of events in a user’s path)A user can see an ad for the first time at anyposition in the sequence of events in their path,which is collected across a common observationwindow. We assume that each user path containsa finite number of events that is less than someinteger, J .

From Assumption 4 we see that every user hasthe potential to see an ad for the first time atany position in their path or, alternatively, notat all. Thus, depending on when a user sees thead, they may have a different number of conver-sions at the end of the experiment or observationwindow. These are known as potential outcomesunder the RCM framework. Namely, for eachuser there is at a maximum, J + 1 potential out-comes (number of conversions), depending if theuser saw the ad and in what position they firstsaw it. Note that here we are implicitly assumingthat each user has a fixed organic path that onlychanges through the intervention of an ad anddoes not change through repeated experiments.This allows us to treat the organic path as acovariate that is only partially observed depend-ing on what position the ad was served. We do,however, allow for the effect of the advertising tobe stochastic and hence the J + 1 potential out-comes are random variables with each realizationrepresenting the number of conversions resultingfrom a realized path.

For every user (and ignoring any user indexfor now) we let {Cj | for j = 0, 1, . . . , J} be theset of these potential outcomes (Splawa-Neymanet al., 1990; Rubin, 2005) where C0 is definedas the number of organic conversions (when theuser does not see the ad at all). We also letZ ∈ {0, 1, . . . , J} be the treatment index thatidentifies the position in a user’s path in whichthey saw the ad for the first time. Finally, welet {Uj | for j = 0, 1, . . . , J} be the subsequenceof events that occurred upstream of, but not in-cluding, the first ad event at position j.

5 Estimand

Under Assumption 2, if a user does not see an adwhen the channel is active then this user shouldhave the same number of conversions as if thechannel had been inactive. Hence, we can write

5

our estimand as

∆ = E(# Conversions when channel is active)

− E(# Conversions when channel is inactive)

= E(J∑

j=1

1Z(j)(Cj − C0))

=

J∑j=1

E(Cj − C0|Z = j)P(Z = j)

=J∑

j=1

∆jpj (1)

where ∆j = E(Cj − C0|Z = j) and pj = P(Z =j). Note that in the simple binary treatment case∆j = E(Cj − C0|Z = j) is a standard estimandknown as the Average Treatment Effect for theTreated (ATT). However, in this case, there is anATT for each event number, {∆j | j = 1, . . . , J},in which a user could see the ad for the firsttime. Our estimand, ∆, then is just the aver-age of these ATT’s weighted by the probabilityof seeing the ad for the first time at each eventnumber.

The issue with any experiment is that only onepotential outcome can ever be observed for eachuser. That is, it’s not possible to apply treat-ment (i.e., show an ad for the first time) at twodifferent points in the path for any given user,let alone apply treatment for the first time atevery point in the path for every user. This isknown as the fundamental problem of causal in-ference (Holland, 1986) and it means that, foreach user in our experiment, we will be missingall but one of {Cj | for j = 0, 1, . . . , J}. Notewe defined Z as the variable denoting the pointin the path that the ad was served in the real-ization of the experiment. The definition of Zalong with the following identifiability assump-tion formally links our potential outcomes underseeing the ad at any point in the event sequenceand the actual outcome that was observed in theexperiment that took place,

Corollary 1 (Identifiability)

Cobs =J∑

j=0

Cj1Z(j).

A consequence of SUTVA (Assumption 2) is thatthe potential outcomes are statistically identifi-able because any change in their distribution willnaturally change the distribution of the observeddata.

Our final assumption, Assumption 5, relatesto the ad serving mechanism, otherwise knownas the treatment assignment. Here we assumethat by event number j, we have observed allof the relevant information related to whether auser may see an ad or not at that point so thatthe treatment status is independent of the valueof the potential outcomes.

Assumption 5 (Strongly Ignorable Treat-ment Assignment)For j = 1, . . . , J assume

(C0, Cj) ⊥⊥ 1Z(j)|Uj

and0 < P (Z = j|Uj = u) < 1

for all u and j = 1, . . . , J .

A consequence of this assumption is that thereare no unobserved confounders. That is, thereare no missing variables that are related both totreatment and the potential number of conver-sions. This assumption is required in order tocorrectly identify and estimate the causal effectof advertising. If there was such a confounder itcan easily be seen that any differences in conver-sion rates between users who saw ads and thosewho did not could, perhaps, be partially or fullyexplained by the missing confounder.

For our experiment this assumption equatesto saying that the sequence of events upstreamof the ad serving opportunity are all that is re-quired to estimate the effect of the ad at thatpoint in the event sequence. In general, this as-sumption does not hold due to the existence of adtargeting. It obviously ignores important covari-ate information, such as age and location, whichare very often used in ad targeting. In order toalign with common practices, and for our com-parison to be on an even keel with rules-basedattribution algorithms, we restrict ourselves toonly having the sequence of events available.

6

However, making covariate information availableis the biggest opportunity for improving attribu-tion, and the algorithm we describe is easily ex-tended to a use case in which additional covariateinformation is available.

6 Estimation

Our estimator is directly informed from the es-timand in (1) as we can estimate ∆ via a plug-in estimator by estimating each ∆j and pj sep-arately for all j. This is desirable as we havereduced the rather complicated problem of attri-bution to estimating the average treatment effecton the treated (ATT) for a binary treatment,which is a standard problem.

∆ is the average effect for a randomly sampleduser and hence we do not index by user in defin-ing ∆. In our estimation, however, we utilize allexperimental units (users) in our experiment andwe use i to index these users.

To estimate pj we simply use the sample meanin each group

p̂j =Nj

N

where Nj =∑N

i=1 1Zi(j), and N is the numberof users sampled as defined in Assumption 1.

In order to estimate ∆j we need to further linkour potential outcomes to the observed data. Bythe law of total expectation we see that

∆j = E(E(Cj − C0|Z = j, Uj)|Z = j)

= E(∆(j,Uj)|Z = j)

where ∆(j,u) = E(Cj −C0|Z = j, Uj = u). Thenusing (Corollary 1) and (Assumption 5) we findthat

∆(j,u) = E(Cobs|Z = j, Uj = u)

− E(Cobs|Z = 0, Uj = u). (2)

(2) shows that ∆(j,u) can be expressed as thedifference in the mean number of conversions forusers with upstream path, u, who saw the ad atevent j and those users with the same upstream

path, u, who did not see the ad at event j. Thisimmediately leads to the following estimator,

∆̂(j,u) =C(j, u)

N(j, u)− C(0, u)

N(0, u), (3)

where C(j, u) =∑N

i=1 1Zi(j)1Uij (u)Cobsi repre-

sents the number of observed conversions andN(j, u) =

∑Ni=1 1Zi(j)1Uij the number of users

who saw an ad in position j with upstream pathu.

We can interpret ∆̂(j,u) as the difference be-tween sample conversion rates for users with up-stream path u who saw the ad at time j andthose who did not. In short it is an estimateof the incremental effect of advertising at timej on users who have already taken path u. Forthe jth event, there are many different possibleupstream paths. We can average over these up-stream paths to estimate the effect of advertisingfor this event number,

∆̂(j) =∑u

∆̂(j,u)N(j, u)

Nj. (4)

We now have an estimate for the effect of ad-vertising for each value of j and we can averageover the distribution of when users saw the adin order to estimate the overall average effect ofadvertising in the treated population,

∆̂ =J∑

j=1

∆̂(j)p̂j . (5)

Note that this model is equivalent to the onedescribed in Sapp and Vaver (2016).

To demonstrate how this estimator works werevisit the path from Figure 1. In this path, wenote that the user saw the Search Ad Impres-sion1 in the third position in the path, Z = 3,and had upstream path U = (Direct Naviga-tion, Display Impression). In Figure 2, we have

1A search ad impression represents the interventiontaken by the advertiser. We are interested in the causaleffect of this impression but attribution models often onlyhave access to search clicks. As such, under some assump-tions about the click through rate, search clicks are usedin practice as proxies for impressions even though theyare outcomes and are caused by the intervention.

7

aggregated all those paths that have this sameupstream path, and then separated them intogroups that saw an ad in position 3 and thosethat did not see an ad at all. From (3) we con-struct the estimator,

∆̂3,U =10

100− 20

500= 0.1− 0.04 = 0.06.

To estimate the effect of the entire ad chan-nel for search, ∆, this process is repeated for alluniquely observed upstream paths to estimate∆3, and then this process is repeated to find each∆j .

Figure 2: An Example of Estimation

6.1 Missing Data Bias Adjustment

The estimator in (5) would be sufficient if itweren’t for a data collection issue that plaguesattribution products. By definition the datacollection system only collects user paths thathave at least one observable event within theobservation window. Some events are not de-tectable by the data collection process and sousers only enter the database based on whetherthey have an observable event in the observa-tion window. This issue is described in Sapp andVaver (2016) in the section Systematically Cen-sored Users where the authors note the bias ofupstream data driven attribution in the presenceof this censoring.

This censoring would not be an issue if thecensoring affected both treated (those who seean ad) and control (those who do not see an ad)users the same. However, given that seeing anad is captured by the data collection mechanismwe see that all treated users are observed andthe control users, who do not have an observableevent, will be missing. Hence, using the esti-mator described in (4) and (5) naively withoutadjusting for this censoring will result in a biasedestimate of the causal effect.

This problem only arises in the estimation of∆1. This is because users who do not immedi-ately see an ad, i.e., users with j ≥ 2, we knowthat the first event in the path must have beenan observable non-ad event. In the estimationof ∆j the data collection process affects bothtreated and control equally as all units enteredthe sample due to the first observable event. Onthe other hand, estimating ∆1 is problematic be-cause there is no prior observable event to matchon when constructing a counterfactual controlgroup. All of the relevant treated users are ob-served by virtue of seeing the ad, however, not allusers who could have seen the ad, but didn’t, areobserved as some may not have any observableevent in the observation window.

Figure 3: Missing Data Example

This issue is demonstrated in Figure 3 whereboth User 1 and User 4 have identical upstreampaths leading up to the start of the observationwindow. User 4 should be used in the estima-tion of the counterfactual control conversion ratefor User 1 but as our observation window didnot encompass any observable events for User 4they were censored and did not enter the dataset.The same issue does not exist in estimating thecounterfactual conversion rate for User 2, how-ever, as there is an observable first event andhence the counterfactual control path (User 3 )will also have an observable first event and enterthe dataset.

Since we assumed that the observation windowis selected at random, and given our assumptionthat only the sequence of events are informative,

8

rather than the passage of time (Assumption 3),the distribution of upstream paths leading to anad should be the same just prior to the start ofthe observation window as it is in the observationwindow. Hence, the group of users who imme-diately see an ad can be thought of as a mix-ture of the treated users, each with varying up-stream paths or, equivalently, as a random sam-ple of treated users who happened to have theirupstream path censored. This is illustrated in

Figure 4: Mixture Example

Figure 4 where User A1 and User B1 are indis-tinguishable in the observation window. If thewindow had begun earlier, we would be able todistinguish them as is possible with User A2 andUser B2. In this example we see that the distri-bution of the missing upstream paths is the sameas the observed upstream paths in the observa-tion window.

This property is implied by the assumptionthat the passage of time is uninformative giventhe event sequence. Under this assumption, andthe fact that the missing upstream paths wouldhave the same distribution as the observed onesin the observation window, we can construct theestimator for ∆1,

∆̂1 =

∑Jj=2 ∆̂(j)p̂j∑J

j=2 p̂j, (6)

which can be used to compute the first term in(5). Note that (6) is simply the average effectof seeing the ad at each position weighted by theestimated probability of seeing the ad in that po-sition. Or, equivalently, it is just the estimatedaverage treatment effect for those who saw an adafter the first event. This adjustment is a fix tocensoring issue for the UDDA model describedin Sapp and Vaver (2016).

Note that we can test Assumption 3 which un-derpins the theory behind our calculation in (6).This is because, although we cannot construct acounterfactual for the group of users who imme-diately see an ad, we can compare the observedconversion rate under seeing the ad with the rateimplied by the mixture. That is, we have two es-timates of the conversion rate under seeing thead, r1, where rj = E(Cj |Z = j). The first esti-mate uses the sample rate for all users who hap-pen to see the ad immediately, r̂1, where

r̂j =

∑Ni=1C

obsi 1Zi(j)

Nj.

and the second uses the rate implied by the mix-ture,

∑Jj=2 r̂

j p̂j . We can test to see if these twoestimates are truly estimating the same true pop-ulation quantity by undergoing a two-sample t-test utilizing the point and standard error esti-mates of each method.

7 Evaluation

One way to evaluate the method described inSection 6 is to see how closely the attributionalgorithm can estimate the true effect of ad-vertising. This is accomplished by comparingestimates from the attribution algorithm withresults from corresponding experiments under-taken on real users. However, this is typicallya costly procedure and advertisers are often re-luctant to turn off entire ad channels due to themissed opportunity cost of not serving ads topotential converting users. Hence, we strive toprovide some level of validation through a simu-lation study in which we simulate the generationof user paths and the ways in which these pathsare affected by ads.

To accomplish this we utilize the DigitalAdvertising System Simulation (DASS) (Sappet al., 2016). DASS models the way in whichusers traverse the internet via a non-stationaryMarkov process. It generates user level path dataand has the flexibility to do so under many dif-ferent assumptions about how advertising affectsuser behavior. We use DASS to generate sets ofpath data for which we can find the true effect

9

of an ad channel by allowing simulated users toexperience both arms of an experiment (ad chan-nel is active/inactive). The difference betweenthe number of conversions in these two arms isby definition, ∆, as defined in (1). This exper-imental ground truth is used to determine howwell an algorithm fairs in estimating ∆ when itonly observes path data from the experimentalarm in which the ad channel is active, as is thecase in practice.

We compare the algorithm described above,UDDA Bias Adjusted, to the existing algorithm,UDDA, as described in Sapp and Vaver (2016)where UDDA stands for Upstream Data-DrivenAttribution. Note that UDDA Bias Adjusted isfunctionally identical to UDDA without the biasadjustment step noted in Section 6.1.

It is important to evaluate an attribution algo-rithm under many different conditions (e.g., dif-ferent types of ad impact on user behavior, differ-ent magnitudes of ad impact, and different mixesof ad channels) to help ensure that it has theflexibility to handle situations that are encoun-tered in practice. See Singh and Vaver (2017)for a more in depth discussion of model evalua-tion and the development of evaluation scenarios.Here we focus on the evaluation scenarios previ-ously considered by Sapp and Vaver (2016).

In Figure 5 we examine how UDDA Bias Ad-justed compares with the UDDA algorithm intro-duced in Sapp and Vaver (2016). This scenarioexamines the ability of an attribution algorithmto identify changes in the effectiveness of Displayadvertising. In this scenario, each time a usersees a display ad impression there is a chancethat the user’s subsequent browsing behavior willbe altered. In moving from left to right on thex-axis, the effectiveness of the ad is increased inthe sense that users are more likely to favor ac-tivities that will lead to a conversion (e.g., do anadvertiser-related search or visit the advertiser’swebsite). The Truth line indicates the true dif-ference in the number of conversions between thetwo experimental arms in which the Display adchannel is active versus inactive. The error barson the ground truth represents a 95% interval forthe ground truth constructed through repeatedexperiments.

Figure 5: This plot is analogous to the first plotfrom Figure 6 in Sapp and Vaver (2016) in whichdisplay impressions affect downstream browsingbehavior.

Figure 6: This plot is analogous to the secondplot in Figure 6 from Sapp and Vaver (2016) inwhich display impressions affect the probabilityof conversion.

Figure 6 has a similar setup to Figure 5 withthe exception that in this scenario the displayadvertising only affects a user’s propensity toconvert and does not otherwise alter the user’sdownstream browsing behavior. Again, in this

10

scenario we see that UDDA Bias Adjusted closelymatches the Truth.

Figure 7: Here we visit a scenario in which the adserving mechanism targets users differently de-pending on some covariate information not avail-able to the attribution algorithm. In movingalong the x-axis from left to right the effect of adtargeting diminishes to zero as the users becomemore homogeneous and hence are more similarlytargeted.

Figure 7 corresponds to a scenario in whichboth UDDA and UDDA Bias Adjusted shouldhave trouble. Poor performance is expected be-cause there is extraneous covariate informationnot available to the attribution algorithms thatis needed to inform both a user’s propensity tosee an ad and their propensity to convert. Thisis a violation of Assumption 5. The result is thatcounterfactual comparisons may be comparinggroups with different distributions of importantcovariates that explain some of the difference inconversion rates.

As expected, UDDA Bias Adjusted fairs worsewhen ad targeting is more prevalent and is unbi-ased when there is no form of ad targeting. Thisis a problem that can’t be fixed by changing theattribution algorithm. It can only be fixed byproviding the attribution algorithm with covari-ate data related to ad targeting.

8 Conclusions

In digital advertising, attribution is widely usedto inform a variety of marketing decisions. Inthis paper we highlight the need to identify a sin-gular objective (Section 2) for attribution mea-surement and propose that advertisers are reallyinterested in the marginal effect of an ad channel.

In viewing attribution through a scientific lens,we propose a hypothetical experiment in Section3 that attribution products might aim to approx-imate. This experimental viewpoint allows forthe creation of an easily interpretable estimandof interest (Section 5) which is based on a repli-cable action by the advertiser (turning off an adchannel).

In Section 4 we describe the attribution prob-lem in a causal framework using the RubinCausal Model (RCM). The benefit is that thisapproach requires us to construct the assump-tions that are needed to be able to identify andestimate the estimand (causal effect) of interest.These assumptions are made so that we can moreeasily identify situations in which an attributionalgorithm will do well or fall short in terms of es-timating ground truth. They also make it easierto identify avenues for algorithm or data sourceimprovement. The specific description providedwas largely made for the purpose of demonstrat-ing how a simple attribution algorithm can fitinto a causal framework. In practice, and de-pending on the particular attribution environ-ment, some of these assumptions may not holdand can be removed, relaxed or extended.

In Section 6.1, we propose a solution for theproblem of working with path data that system-atically censors users. This issue has plaguedprevious attribution algorithms and left themunusable in practice (Sapp and Vaver, 2016).

In Section 7 we conclude with a simulationstudy highlighting the efficacy of the UDDA BiasAdjusted algorithm. We use the DASS simulatorto compare the estimates of UDDA and UDDABias Adjusted to the Truth and note the vast im-provement over UDDA due to the missing dataadjustment. In Figures 5 and 6, we see thatUDDA Bias Adjusted improves UDDA and nowall estimates fall within the 95% interval. Addi-

11

tionally, in Figure 7 we see improvements whenad targeting is present and hence there is a largedegree of bias and estimation is difficult. We con-clude that UDDA Bias Adjusted is a viable solu-tion to the Systematically Censored Users prob-lem described in Sapp and Vaver (2016).

In short, we have highlighted issues with cur-rent attribution practices and algorithms and de-scribed the attribution problem under a causalframework. This framework provides a usefulguide for understanding and addressing deficien-cies in attribution modeling. It should be the ba-sis for all future attribution model developmentefforts.

9 Acknowledgments

We would like to thank many of our colleaguesat Google for their comments and suggestions.In particular we acknowledge the attributionteam in Advanced Measurement Technologies fortheir many discussions and Tony Fagan and AmyRichardson for providing comments and feed-back.

A Proposal for EmbeddingSynergistic Effects Into OneMeasure Per Channel

A.1 Model for two channels

Consider the case of two channels, A and B,where we can run a full-factorial experiment asshown in Figure A1. One model formulation isas follows: 2

Yij = β0+β11i(1)+β21j(1)+β121i(1)1j(1)+εij ,

where the i subscript is for channel A indicatingoff or on by 0 or 1, respectively, and the j sub-script is for channel B similarly indicating off and

2This deviates from the classic analysis of experimentsmodel where the intercept represents all channels are halfon and half off and where the X-variables are coded as-1 and +1. However, the proposed formulation provideseasier understanding of the de-duping of credit and howto adjust to fit into the ROAS reporting requirements.

A B X0 X1 X2 X12

Y00 0 0 1 0 0 0Y10 1 0 1 1 0 0Y01 0 1 1 0 1 0Y11 1 1 1 1 1 1

Table A1: Full factorial experiment for two chan-nels along with corresponding design matrix X(last four columns)

on by 0 and 1. This can be written in matrix no-tation as Y = Xβ+ε where Y is a column vectorgiven in the first column of Table A1, X is a ma-trix given in the last four columns of Table A1,X is a matrix given in the last four columns ofTable A1, β = (β0, β1, β2, β12)

T and ε is a vectorof error terms. 3

Figure A1: Full factorial experiment for twochannels A and B with resulting outcomes Yij

We can solve for β using least squaresβ̂ = (XTX)−1XTY where the matrix Q =(XTX)−1XT is the linear combinations of thedata used to estimate the parameters of themodel. For the two channel case this is givenin Table A2. So, for example, β̂0 = Y00 andβ̂12 = (Y00 − Y10 − Y01 + Y11).

Now consider our estimate for the total num-ber of incremental conversions:

∆C = Y11 − Y00 = β1 + β2 + β12

3The error terms aren’t particularly interesting hereas the rank of the model equals the number of observa-tions so estimation of the errors can’t be derived from themodel - although they could be from external sources (i.e.,estimated from a simulator directly). We will assume thatthe errors are negligible.

12

Y00 Y10 Y01 Y110 1 0 0 01 -1 1 0 02 -1 0 1 012 1 -1 -1 1

Table A2: Q matrix for two channels.

and the estimates of the marginal impact of eachchannel (e.g., impact of the target channel whenthe other channel is on):

∆C(A;B on) = Y11 − Y01 = β1 + β12

∆C(B;A on) = Y11 − Y10 = β2 + β12.

Clearly, the inconsistency of the marginal im-pact estimates and the desire that these esti-mates sum to the total ad-driven conversions isthe double counting of the synergistic or inter-action effects between the two channels A and B(i.e., β12). A reasonable solution is to give equalcredit to each channel for the interaction effects.Hence,

∆A = β1 + β12/2

∆B = β2 + β12/2

which gives the desired property that ∆A +∆B = ∆C. These adjusted credits can be easilycalculated as linear combinations of the Y ’s:

∆A ≈ (Q2. +Q4./2)Y

= (−0.5,+0.5,−0.5,+0.5)Y

= 0.5(Y10 − Y00) + 0.5(Y11 − Y01)∆B ≈ (Q2. +Q4./2)Y

= (−0.5,−0.5,+0.5,+0.5)Y

= 0.5(Y01 − Y00) + 0.5(Y11 − Y10).

So now the estimates are the average of the effectof turning a channel on when the other channelis on and is off. Referring to Figure A1, this isfor channel A the average effect of moving fromthe left side of the square to the right side andfor channel B moving from bottom to top.

A.2 Model for Three Channels

The model for two channels extends easily tothree channels. Here we denote the three chan-

β0 β1 β2 β3 β12 β13 β23 β123Y000 1 0 0 0 0 0 0 0Y100 1 1 0 0 0 0 0 0Y010 1 0 1 0 0 0 0 0Y110 1 1 1 0 1 0 0 0Y001 1 0 0 1 0 0 0 0Y101 1 1 0 1 0 1 0 0Y011 1 0 1 1 0 0 1 0Y111 1 1 1 1 1 1 1 1

Table A3: Full factorial design matrix, X, forthree channels.

nels as A, B, and D4 and now use subscript k forchannel D - again 0 indicating the channel is offand 1 indicating the channel is on. The model is

Yijk = β0 + β11i(1) + β21j(1) + β31k(1)

+ β121i(1)1j(1) + β131i(1)1k(1)

+ β231j(1)1k(1) + β1231i(1)1j(1)1k(1)

+ εijk (7)

Again, this can be written in matrix notation asY = Xβ where the X matrix is given in TableA3. The visualization of the design is given inFigure A2 and similar to the two channel caseβ̂ = (XTX)−1XTY = QY and the Q matrix isgiven in Table A4.

Figure A2: Full factorial experiment for threechannels A, B and D with resulting outcomesYijk

The estimate for the total number of conver-

4We use D to not confuse C with conversions.

13

Y000 Y100 Y010 Y110 Y001 Y101 Y011 Y111β0 1 0 0 0 0 0 0 0β1 -1 1 0 0 0 0 0 0β2 -1 0 1 0 0 0 0 0β3 -1 0 0 0 1 0 0 0β12 1 -1 -1 1 0 0 0 0β13 1 -1 0 0 -1 1 0 0β23 1 0 -1 0 -1 0 1 0β123 -1 1 1 -1 1 -1 -1 1

Table A4: Q matrix for three channels.

sions is

∆C = Y111 − Y000= β1 + β2 + β3 + β12 + β13 + β23 + β123

and for the marginal impact of each channel (i.e.,impact of the target channel when the otherchannels are on) it is

∆C(A;B and D on) = Y111 − Y011= β1 + β12 + β13 + β123

∆C(B;A and D on) = Y111 − Y101= β2 + β12 + β23 + β123

∆C(D;A and B on) = Y111 − Y110= β3 + β13 + β23 + β123.

We have the two-way interaction terms be-ing double counted while the three-way inter-action term is triple counted. Applying splitcredit across interacting channels gives our ad-justed credits

∆A = β1 + β12/2 + β13/2 + β123/3

≈ (Q2. +Q5./2 +Q6./2 +Q8./3)Y

∆B = β2 + β12/2 + β23/2 + β123/3

≈ (Q3. +Q5./2 +Q7./2 +Q8./3)Y

∆D = β3 + β13/2 + β23/2 + β123/3

≈ (Q4. +Q6./2 +Q7./2 +Q8./3)Y.

We get the desired property that ∆A+ ∆B +∆D = ∆C. The detailed estimating coefficientsare given in Table A5. All of these estimates areweighted contrasts from when a channel is onvs. off. For the three channel case, turning on achannel when either all other channels are off orall channels are on are weighed twice as much aswhen only one other channel is on.

Y000 Y100 Y010 Y110 Y001 Y101 Y011 Y111∆ A -1/3 1/3 -1/6 1/6 -1/6 1/6 -1/3 1/3∆ B -1/3 -1/6 1/3 1/6 -1/6 -1/3 1/6 1/3∆ D -1/3 -1/6 -1/6 -1/3 1/3 1/6 1/6 1/3

Table A5: Estimating coefficients for estimatingadjusted (additive) credit for three channels.

A.3 Model for General Number ofChannels

The model is extended to the general case withp channels. Now the indices denote channelsrather than levels of channels as in the examplesabove. The full factorial design can be repre-sented by the vertices of a p-dimensional hyper-cube - each vertex representing a combinationof channels on and off. Hence, we have n = 2p

simulation results. Let the n-length column vec-tor xi represent the on/off settings (0/1) forchannel i. Further, define the n-length columnvectors xij = xixj , xijk = xixjxk, . . . , x12...p =x1x2 . . . xp. The model,

Yij...p = β0 +∑i

βixi +∑i<j

βijxij

+∑i

∑j<i

∑k<j

βijkxijk + . . .

+ β12...px12...p + ε12...p,

can again be represented as Y = Xβ where wecan compute Q = (XTX)−1XT to get the linearcombinations of the Y ’s that estimate the β’s.The estimate of the total number of incrementalconversions is

∆C =∑i

βi +∑i<j

βij

+∑i

∑j<i

∑k<j

βijk + . . .+ β12...p.

While the marginal impact of channel i (turningith channel on when all others are already on) is

∆(i; rest on) = βi +∑j 6=i

βij +∑

k<j,k 6=i

βijk

+ . . .+ β12...p.

14

All two-way interactions are double counted,three-way triple counted, . . ., and the p-way in-teraction is p-times counted. Hence the adjustedcredit for channel i is

∆(i) = βi +∑j 6=i

βij/2 +∑

k<j;k 6=i

βijk/3

+ . . .+ β12...p/p.

These can be easily computed by taking the ap-propriate rows of the Q matrix, dividing eachrow by the appropriate de-duping and thenadding to get the contrast to calculate ∆(i).

References

Analytics, G. (2017). Google analytics helpcenter: About the default attributionmodels. Technical report, Google Inc.https://support.google.com/analytics/

answer/1665189.

Analytics, G. (2018). Google attributioncapabilities. Technical report, GoogleInc. https://www.google.com/analytics/

attribution/capabilities/.

Holland, P. W. (1986). Statistics and causal in-ference. Journal of the American StatisticalAssociation, 81(396):945–960.

Imbens, G. W. and Rubin, D. B. (2015). CausalInference for Statistics, Social, and BiomedicalSciences: An Introduction. Cambridge Univer-sity Press.

Rubin, D. B. (1974). Estimating causal effectsof treatments in randomized and nonrandom-ized studies. Journal of educational Psychol-ogy, 66(5):688.

Rubin, D. B. (2005). Causal inference using po-tential outcomes: Design, modeling, decisions.Journal of the American Statistical Associa-tion, 100(469):322–331.

Sapp, S. and Vaver, J. (2016). Toward improv-ing digital attribution model accuracy. Tech-nical report, Google Inc. https://research.google.com/pubs/pub45766.html.

Sapp, S., Vaver, J., Shi, M., and Bathia,N. (2016). Dass: Digital advertising sys-tem simulation. Technical report, GoogleInc. https://research.google.com/pubs/

pub45331.html.

Shapley, L. S. (1953). A value for n-persongames. Contributions to the Theory of Games,2(28):307–317.

Singh, K. and Vaver, J. (2017). Attributionmodel evaluation. Technical report, GoogleInc.

Splawa-Neyman, J., Dabrowska, D., Speed, T.,et al. (1990). On the application of probabil-ity theory to agricultural experiments. essayon principles. section 9. Statistical Science,5(4):465–472.

15

https://support.google.com/analytics/answer/1665189

https://support.google.com/analytics/answer/1665189

https://www.google.com/analytics/attribution/capabilities/

https://www.google.com/analytics/attribution/capabilities/

https://research.google.com/pubs/pub45766.html




a causal framework for digital attribution · a causal framework for digital attribution joseph...

Documents