how does mobile app failure affect purchases in …
TRANSCRIPT
1
HOW DOES MOBILE APP FAILURE AFFECT PURCHASES IN ONLINE AND OFFLINE CHANNELS?
Unnati Narang Venkatesh Shankar Sridhar Narayanan
December 2020 * Unnati Narang ([email protected]) is Assistant Professor of Marketing, University of Illinois, Urbana Champaign, Venkatesh Shankar ([email protected]) is Professor of Marketing and Coleman Chair in Marketing and Director of Research, Center for Retailing Studies at the Mays Business School, Texas A&M University, and Sridhar Narayanan ([email protected]) is an Associate Professor of Marketing at the Graduate School of Business, Stanford University. We thank the participants at the ISMS Marketing Science conference, the UTDFORMS conference, and research seminar participants at the University of California, Davis, the University of Toronto, the University of Illinois, Urbana Champaign, and the University of Texas at Austin for valuable comments.
2
Abstract Mobile devices account for a majority of transactions between shoppers and marketers. Branded retailer mobile apps have been shown to significantly increase purchases across channels. However, app service failures can lead to decreases in app usage, making app failure prevention and recovery critical for retailers. Does an app failure influence purchases in general and within the online channel in particular? Does it have any spillover effects across other channels? What potential mechanisms explain and what factors moderate these effects? We examine these questions empirically, employing a unique dataset from an omnichannel retailer. We leverage a natural experiment of exogenous systemwide failure shocks in this retailer’s mobile app and related data to examine the causal impact of app failures on purchases in all channels using a difference-in-differences approach. We investigate two potential mechanisms behind these effects – channel substitution and brand preference dilution. We also analyze shopper heterogeneity in the effects using a theoretically-driven moderator approach as well as a data-driven machine learning method. Our analysis reveals that although an app failure has a significant overall negative effect on shoppers’ frequency, quantity, and monetary value of purchases across channels, the effects are heterogeneous across channels and shoppers. Interestingly, the decreases in purchases across channels are driven by purchase reductions in brick-and-mortar stores and not in the online channel. A significant decrease in app engagement post failure explains the overall drop in purchases. Brand preference dilution after app failure explains the fall in store purchases, while channel substitution post failure explains the preservation of purchases in the online channel. Surprisingly, purchases rise for a small group of shoppers who were close to the retailer’s store at the time of app failure. Furthermore, shoppers with a higher monetary value of past purchases, and less recent purchases are less sensitive to app failures. The results suggest that app failures lead to an annual revenue loss of about $2.4-$3.4 million for the retailer in our data. About 47% shoppers contribute to about 70% of the loss. We outline targeted failure prevention and service recovery strategies that retailers could employ. Keywords: service failure, mobile marketing, mobile app, retailing, omnichannel, difference-in-differences, natural experiment, causal effects
1
INTRODUCTION
Mobile commerce has seen tremendous growth with mobile devices accounting for a majority of
interactions between shoppers and marketers. This growth has accelerated through the rapid
increase of smartphone penetration – about 3.2 billion people (41.5% of the global population)
used smartphones in 2019.1 Mobile applications (henceforth, apps) have emerged as an important
channel for retailers as they have been found to increase engagement and purchases across
channels (e.g., Kim et al. 2015; Narang and Shankar 2019; Xu et al. 2016).
While retailers have widely embraced mobile apps, there is little understanding about how
service failures in this channel affect shopper behavior. This issue is important because unlike
other channels, the mobile channel is highly vulnerable to failures. The diversity of mobile
operating systems (e.g., iOS, Android), devices (e.g., mobile phone and tablet), and versions of
hardware and software and their constant use across a variety of mobile networks often result in
app failures. Failures in a retailer’s mobile app have the potential to negatively affect shoppers’
engagement with the app and their shopping outcomes within the mobile channel. In addition,
app failures may have spillover effects across other channels due to both substitution of
purchases across channels and dilution of preference for the retailer brand. Understanding how
and why failures impact shoppers’ behavior across channels is important for retailers.
Preventing and recovering from app failures is critical for managers because more than 60%
of shoppers abandon an app after experiencing failure(s) (Dimensional Research 2015). In 2016,
app crashes were the leading cause of system failures, contributing 65% to all iOS failures
(Blancco 2016). About 2.6% of all app sessions result in a crash, suggesting about 1.5 billion app
failures across 60 billion app sessions annually (Computerworld 2014). Given the extent of these
1 Source: Statista report on smartphone penetration (https://tinyurl.com/hy2skfk) last accessed 18 November 2020.
2
app failures and their potential damage to firms’ relationships with customers, determining the
impact of app failures is important for formulating preventive and recovery strategies.
Despite the importance of app failures, not much is known about their impact on purchases.
While app crashes in a shopper’s mobile device have been shown to negatively influence app
engagement (e.g., restart time, browsing duration, and activity level, Shi et al. 2017), the
relationship between app failures and subsequent purchases has not been studied. Furthermore, a
large proportion of shoppers use both online (desktop website, mobile website, and mobile app)
and offline (brick-and-mortar) retail channels. However, we do not know much about the impact
of app failures on shopping outcomes across channels (spillover effects).
From a theoretical standpoint, the potential mechanisms behind such effects within and
across channels are important. How much of these effects arise due to channel substitution post
failure? What portion of the effects can be attributed to dilution of preference for the retailer’s
brand? Prior research has not addressed these interesting questions.
The effects of app failure may also differ across shoppers. Shoppers may be more or less
negatively impacted by failures depending on factors such as shoppers’ relationship with the firm
(Chandrashekaran et al. 2007; Goodman et al. 1995, Hess et al. 2003; Knox and van Oest 2014;
Ma et al. 2015) and shoppers’ prior use of the firm’s digital channels (Cleeren et al. 2013; Liu
and Shankar 2015; Shi et al. 2017). It is important for managers to better understand how the
effects of failure vary across shoppers so that they can devise targeted preventive and recovery
strategies. Yet not much is known about heterogeneity in the effects of app failure.
Our study fills these crucial gaps in the literature. We quantify and explain the impact of app
failures on managerially important outcomes, such as the frequency, quantity, and monetary
value of purchases in online and offline channels. We address four research questions:
3
• What are the effects of a service failure in a retailer’s mobile app on the frequency, quantity, and monetary value of subsequent purchases by the shoppers?
• What are the effects of a service failure in an app on purchases in the online and offline channels?
• What potential mechanisms explain the effects of an app service failure on purchases? • How do these effects vary across shoppers or what factors moderate these effects?
Estimation of the causal effects of app failures on shopping outcomes is challenging. It is
typically hard to do this using observational data due to the potential endogeneity of app failures.
This endogeneity may stem from an activity bias in that shoppers who use the app more
frequently are also more likely to experience failures than other shoppers. Therefore, failure-
experiencing shoppers may differ systematically from non-failure experiencers in their shopping
behavior, leading to potentially spurious correlations between failures and shopping behavior.
Panel data may not necessarily mitigate this issue because time-varying app usage/shopping
activity is potentially correlated with time-varying app failures for the same reason. That is,
shoppers are likely to engage more with the app when they are likely to purchase, potentially
leading to more failures than in periods when shoppers engage less with the app. Additionally,
the nature of activity on the app may be correlated with failures. For instance, a negative
correlation between failures and purchases may result from a greater incidence of failures on the
app’s purchase page than on other pages. Thus, it is hard to make the case that correlations
between app failures and shopping outcomes in observational data have a causal interpretation.
The gold standard among the methods available to uncover the causal impact of service
failures is a randomized field experiment. However, such an experiment would be impractical in
this context because a retailer will unlikely deliberately induce failures in an app even for a small
subset of its shoppers for ethical reasons. Alternatively, we can use an instrumental variable
approach to control for endogeneity. However, it is hard to come up with instrumental variables
that are valid and exhibit sufficient variation to address the endogeneity concerns in this context.
4
We overcome the estimation challenges and mitigate the potential endogeneity of app
failures using the novel features of a unique dataset from a large omnichannel retailer of video
games, consumer electronics and wireless service. We exploit a natural experiment of server
error-induced systemwide exogenous failures in the retailer’s mobile app to estimate the causal
effects of app failure. Conditional on signing in on the day of the failure, whether a user
experienced a failure or not was a function of whether they attempted to use the app during the
time window of the failure, which they could not have anticipated in advance. We take
advantage of the resulting quasi-randomness in incidences of failures to estimate the causal
effects of failures on the mobile app. We employ a difference-in-differences (DID) approach that
compares the pre- and post- failure outcomes for the failure experiencers with those of failure
non-experiencers to estimate the effects of the app failure. Through a series of robustness checks,
we confirm that failure non-experiencers act as a valid control for failure experiencers, providing
us the exogenous variation to find causal answers to our research questions.
We investigate the potential mechanisms and moderators of the effects of failures on
shopping behavior by exploiting the panel nature of our dataset. We test for the moderating
effects of factors such as relationship with the firm and prior digital channel use on the effects of
service failures. These factors have been explored for services in general (e.g., Hansen et al.
2018; Ma et al. 2015) but not in the digital or mobile app contexts. In addition, we recover the
heterogeneity of effects at the individual level using data-driven machine learning methods.
Our results show that app failures have a significant overall negative effect on shoppers’
frequency, quantity, and monetary value of purchases across channels, but the effects are
heterogeneous across channels and shoppers. A significant decrease in app engagement (e.g.,
number of app sessions, dwell time, and number of app features used) post failure explains the
5
overall drop in purchases. Interestingly, the overall decreases in purchases across channels are
driven by purchase reductions in stores, rather than in the online channel. The fall in store
purchases after app failure is consistent with brand preference dilution, while the preservation of
purchases in the online channel is consistent with channel substitution. Shoppers experiencing
the failure when they are farther away from purchase (e.g., browsing product information)
experience greater negative effects of a failure than those closer to purchase (e.g., checking out
in the app). Surprisingly, the basket size and value of purchases rise for a small group of
shoppers who were close to the retailer’s store at the time of app failure. Furthermore, shoppers
with a higher monetary value of past purchases and less recent purchases are less sensitive to app
failures. Finally, most shoppers (96%) react negatively to failures, but about 47% of these
shoppers contribute to about 70% of the losses in annual revenues that amount to $2.4-$3.4
million.
In the remainder of the paper, we first discuss the literature related to service failures, cross-
channel spillovers, and consumer interaction with mobile apps. Next, we discuss the data in
detail, summarizing them and highlighting their unique features. Subsequently, we describe our
empirical strategy, layout and test the key identification strategy, and conduct our empirical
analysis of the effects of app failures. We explore the potential mechanisms behind the results.
We then conduct robustness checks to rule out alternative explanations. We conclude by
discussing the implications of our results for managers.
BACKGROUND AND RELATED LITERATURE
Services Marketing and Service Failures
The nature of services has evolved considerably since academics first started to study services
marketing. For long, the production and consumption of services remained inseparable primarily
6
because services were performed by humans. However, of late, technology-enabled services
have risen in importance, leading to two important shifts (Dotzel et al. 2013). First, services that
can be delivered without human or interpersonal interaction have grown tremendously. Online
and mobile retailing no longer require shoppers to interact with human associates to make
purchases. Second, closely related to this idea is the fact that services are increasingly powered
by technologies such as mobile apps that allow anytime-anywhere access and convenience.
With growing reliance on technologies for service delivery and the complexity of the
technology environment in which these services are delivered, service failures are attracting
greater attention. A service failure can be defined as service performance that falls below
customer expectations (Hoffman and Bateson 1997). Service failures are widespread and are
expensive to mend. Service failures resulting from deviations between expected and actual
performance damage customer satisfaction and brand preference (Smith and Bolton 1998). Post-
failure satisfaction tends to be lower even after a successful recovery and is further negatively
impacted by the severity of the initial failure (Andreassen 1999; McCollough et al. 2000). In
interpersonal service encounters, human interactions and employee behaviors influence both
failure effect and recovery (Bitner et al. 1990; Meuter et al. 2000). In technology-based
encounters, such as those in e-tailing and with self-service technologies (e.g., automated teller
machines [ATMs]), the opportunity for human interaction is typically small after experiencing
failure (Forbes et al. 2005; Forbes 2008). However, there may be significant heterogeneity in
how consumers react to service failures (Halbheer et al. 2018).
In the mobile context, specifically for mobile apps, it is difficult to predict the direction and
extent of the impact of a service failure on shopping outcomes. First, mobile apps are accessible
at any time and in any location through an individual’s mobile device. On the one hand, because
7
a shopper can tap, interact, engage, or transact multiple times at little additional cost on a mobile
app, the shopper may treat any one service failure as acceptable without significantly altering her
subsequent shopping outcomes. Such an experience differs from that with a self-service
technological device such as an ATM, which may need the shopper to travel to a specific
location or incur other hassle costs that may not exist in the mobile app context. On the other
hand, the costs of switching to a competitor are also much lower in the mobile app context,
where a typical shopper uses and compares multiple apps. Thus, a service failure in any one app
may aggravate the shopper’s frustration with the app, leading to strong negative effects on
outcomes such as purchases from the relevant app provider.
Second, a mobile app is one of the many touchpoints available to shoppers in today’s
omnichannel shopping environment. Thus, a shopper who experiences a failure in the app could
move to the web-based channel or even the offline or store channel. In such cases, the impact of
a failure on the app could be zero or even positive (if the switch to the other channel leads to
greater engagement of the shopper with the retailer). By contrast, if the channels act as
complements (e.g., if the shopper uses one channel for researching products and another for
purchasing) or if the failure impacts the preference for retailer brand, a failure in one channel
could impede the shopper’s engagement in other channels. Thus, it is difficult to predict the
effects of app failure, in particular, about how they might spill over to other channels.
Channel Choice and Channel Migration
A shopper’s experience in one channel can influence her behavior in other channels. Prior
research on cross-channel effects is mixed, showing both substitution and complementarity
effects, leading to positive and negative synergies between channels (e.g., Avery et al. 2012;
Pauwels and Neslin 2015). The relative benefits of channels determine whether shoppers
8
continue using existing channels or switch to a new channel (Ansari et al. 2008; Chintagunta et
al. 2012). When a bricks-and-clicks retailer opens an offline store or an online-first retailer opens
an offline showroom, its offline presence drives sales in online stores (Bell et al. 2018; Wang and
Goldfarb 2017).2 This is particularly true for shoppers in areas with low brand presence prior to
store opening and for shoppers with an acute need for the product. However, the local shoppers
may switch from purchasing online to offline after an offline store opens, even becoming less
sensitive to online discounts (Forman et al. 2009). In the long run, the store channel shares a
complementary relationship with the Internet and catalog channels (Avery et al. 2012).
While the relative benefits of one channel may lead shoppers to buy more in other channels,
the costs associated with one channel may also have implications for purchases beyond that
channel. In a truly integrated omnichannel retailing environment, the distinctions between
physical and online channels blur, with the online channel representing a showroom without
walls (Brynjolfsson et al. 2013). Mobile technologies are at the forefront of these shifts. More
than 80% of shoppers use a mobile device while shopping even inside a store (Google M/A/R/C
Study 2013). As a result, if there are substantial costs associated with using a mobile channel
(e.g., those induced by app failures), such costs may spill over to other channels. If shoppers use
the different channels in complementary ways, the disruption of one of those channels could
negatively impact their engagement with the other channels as well. However, if shoppers treat
the channels as substitutes, failures in one channel may drive the shoppers to purchase in another
channel. If an app failure dilutes shoppers’ preference for the retailer brand, it may lead to
negative consequences across channels. Overall, the direction of the effect of app failures on
2 A bricks-and-clicks retailer is a retailer with both offline (“bricks”) and online (“clicks”) presence.
9
outcomes in other channels such as in brick-and-mortar stores and online channels depends on
which of these competing and potentially co-existing mechanisms is dominant.
Mobile Apps
The nascent but evolving research in mobile apps shows positive effects of mobile app channel
introduction and use on engagement and purchases in other channels (Kim et al. 2015; Narang
and Shankar 2019; Xu et al. 2016) and for coupon redemptions (Andrews et al. 2015; Fong et al.
2015; Ghose et al. 2019) under different contingencies.
To our knowledge, only one study has examined crashes in a mobile app on shoppers’ app
use. Shi et al. (2017) find that while crashes have a negative impact on future engagement with
the app, this effect is lower for those with greater prior usage experience and for less persistent
crashes. However, while they look at subsequent engagement of the shoppers with the mobile
app, they do not examine purchases. Thus, our research adds to Shi et al. (2017) in several ways.
First, we focus on estimating the causal effects of failure. To this end, we exploit the random
variation in failures induced by systemwide failures. Second, we quantify the value of app
failure’s effects on subsequent purchases. The outcomes we study include the frequency,
quantity, and value of purchases, while the key outcome in that study is app engagement. Third,
we examine the cross-channel effects of mobile app failures, including in physical stores, while
Shi et al. (2017) study subsequent engagement with the app provider only within the app.
Finally, we explore the mechanisms behind the effects of failure , and examine the moderating
effects of relationship with the retailer and prior digital and heterogeneity in shoppers’ sensitivity
to failures using a machine learning approach.
To summarize, our study (1) focuses on the effect of app failure on purchases, (2) quantifies
the effects on multiple outcomes such as frequency, quantity, and monetary value of purchases,
10
(3) addresses the outcomes in each channel and across all channels (substitution and
complementary effects), and (4) uncovers the mechanisms behind and moderators of the effects
of app failure on shopping outcomes and heterogeneity in effects across shoppers. All these
characteristics are novel, contributing to the research streams on service marketing, channel
choice, and mobile apps.
RESEARCH SETTING AND DATA
Research Setting
We obtained the dataset for our empirical analysis from a large U.S.-based retailer. In the
following paragraphs, we describe the retailer, the mobile app, and the channel sales mix.
The retailer sells a variety of products, including software such as video games and hardware
such as video game consoles and controllers, downloadable content, consumer electronics and
wireless services with 32 million customers. The gaming industry is large ($99.6 billion in
annual revenues), and the retailer is a major player in this industry, offering us a rich setting. The
retailer has a large offline presence, and in this respect, is similar to Walmart, PetSmart, or any
other brick-and-mortar chain with an omnichannel strategy. The retailer’s primary channel is its
store network comprising 4,175 brick-and-mortar stores across the U.S. Additionally, it has a
large ecommerce website, and the mobile app that is the focus of our study.
The app allows shoppers to browse the retailer’s product catalog, get deals, order online
through a mobile browser, locate nearby stores, as well as make purchases through the app itself.
The app is typical of mobile apps of large retailers (e.g., PetSmart, Costco) in features and
consumer interactions. The growth in the adoption of the app has also been similar to that of
many large retailers. App adoption rate started small and grew over time. Figure 1 shows some
screenshots from the app.
11
Figure 1
APP SCREENSHOTS
The online and offline channel sales mix of the retailer in our data is typical of most large
retailers. About 76% of the total sales for the top 100 largest retailers in the U.S. are from similar
retailers with a store network of 1,000 or more stores (National Retail Federation 2018). Most
large retailers have a predominant brick-and-mortar presence. For these retailers, while most of
the transactions and revenues come from the offline channels, online sales exhibit rapid growth.
For example, Walmart’s online revenues constitute 3.8% of all revenues, 1.3% of all PetSmart’s
sales come from the online channel, Home Depot generates 6.8% of all revenues from
ecommerce, and 5.4% of Target’s sales are through the online channel.3 For the retailer in our
data, online sales comprised 10.2% of overall revenues, somewhat higher than that for similar
large retailers. Furthermore, about 26% of the shoppers bought online in the 12 months before
the failure event we study. The retailer’s online sales displayed a 13% annual average growth in
the last five years, similar to these retailers who also exhibited double digit growth (Barron’s
3 Source: eMarketer Retail, https://retail-index.emarketer.com/
12
2018). Its annual online sales revenues are also substantial at $1.1 billion. Therefore, our
research context offers a rich setting to examine cross-channel effects of a mobile app failure.
Data and Sample
We study the impact of a systemwide failure that occurred on April 11, 2018.4 The firm provided
us with mobile app use data and transactional data across all channels for all the app users who
logged into the app on the failure day. The online channel represents purchases at the retailer’s
website, including those using the mobile browser. Nested within the app use data are data on
events that shoppers experience, along with their timestamps. The mobile dataset recorded the
app failure event as ‘server error.’ Thus, this event represents an exogenous app breakdown, and
the data allow us to identify shoppers who logged in to experience the systemwide app failure.
Table 1 provides the descriptive statistics for the variables of interest. Over a period of 14
days pre- and post- failure, shoppers make an average of a little less than one purchase
comprising about 1.6 items for a value of about $43. In the 12 months before failure, shoppers
make purchases worth $623 and on average, buy .66 times in the online channel. Overall, 52% of
the shoppers experience the failure during our focal failure event.
Table 1 SUMMARY STATISTICS
Variable Mean Std. dev.
Frequency of purchases .82 1.34
Quantity of purchases 1.61 3.32
Value of purchases ($) 43.31 96.42
App failure/Failure experiencer .52 .50
Recency of past purchases (in days) -45.68 68.83
Value of past purchases ($) 629.60 699.38
Frequency of past online purchases .66 1.97 Notes: These statistics of the variables are over pre- and post- 14 days of the failure. The past purchases are computed over a one-year period. N = 273,378.
4 We verified that this failure was systemwide and exogenous through our conversations with company executives.
13
EMPIRICAL STRATEGY
Overall Empirical Strategy
As outlined earlier, we leverage the exogenous systemwide shock to estimate the causal effect of
app failure on shopping outcomes. The main idea behind our empirical approach is that
conditional on the attempted usage of the app on the day of the failure, the experience of a failure
by a specific shopper is random. We examine this assumption in the data by testing for balance
between shoppers who experience a failure and those who do not, using a set of pre-failure
variables. We find no systematic difference in these variables between shoppers who
experienced failures and those who did not, supporting our identification strategy. To determine
the treatment effect of a failure, we conduct a DID analysis, comparing the post-failure behaviors
with the pre-failure behaviors of shoppers who logged in on the day of the failure and
experienced it (akin to a treatment group) relative to those who logged in on that day but did not
experience the failure (akin to a control group).
To analyze the treatment effects within and across channels, we repeat this analysis with the
same outcome variables separately for the offline and online channel. To understand the
underlying mechanisms for the effects, we examine two explanations, brand preference dilution
and channel substitution, using the data on shoppers’ app engagement, closeness to purchase,
location at the time of failure, time to next purchase, and shipping costs to check for consistency
with these mechanisms. To analyze heterogeneity in treatment effects, we first perform a
moderator analysis using a priori factors identified in the literature such as prior relationship
strength and digital channel use, followed by a data driven machine learning (causal forest)
approach to fully explore all sources of heterogeneity across shoppers. Finally, we carry out
multiple robustness checks.
14
Exogeneity of Failure Shock
To verify that there is no systematic difference between shoppers who experience the failure
shock and those who do not, we examine two types of evidence. First, we present plots of the
behavioral trends in shopping for both failure-experiencers and non-experiencers for the failure
shock in the 14 days before the app failure. Figure 2 depicts the monetary value of daily
purchases by those who experienced the failure and those who did not. The purchase trends in
the pre-period are parallel for the two groups (p > .10), providing us assurance that these
shoppers do not systematically differ across the two groups. The trends are similar for the
frequency and quantity of purchases, and the proportion of online purchases (see Web Appendix
Figure D1).
Figure 2 COMPARISON OF FAILURE-EXPERIENCERS’ AND NON-EXPERIENCERS’ PURCHASES 14 DAYS
BEFORE FAILURE
Note: The red line represents failure experiencers, while the solid black line represents the failure non-experiencers.
Second, we compare the failure experiencers with non-experiencers across shopping
behaviors, such as recency of purchases and frequency of past online purchases (see Figure 3)
and past app usage sessions (see Figures 4 and 5). We also compare their observed demographic
variables, such as gender and membership in loyalty program. We do not find any significant
differences in these variables across the groups.
15
Figure 3 COMPARISON OF FAILURE-EXPERIENCERS AND NON-EXPERIENCERS
Note: Loyalty program level represents whether shoppers were enrolled (=1) or not (=0) in an advanced reward program.
Figure 4 PAST DAILY AVERAGE APP SESSION TRENDS OF FAILURE EXPERIENCERS VS. NON-
EXPERIENCERS
Figure 5
PAST DAILY AVERAGE NON-PURCHASE RELATED APP SESSION TRENDS OF FAILURE EXPERIENCERS VS. NON-EXPERIENCERS
Note: Non- purchase-related app sessions involve browsing pages whose actions are farther from purchase--such as browsing products or obtaining store related information. To summarize, we find no systematic differences between the failure experiencers and those who
do not experience failures in either their trends of outcomes before the failure event, or in other
35.17%
86.49%
1.53
0.6234.49%
84.39%
1.51
0.70
0
1
2
3
4
5
Gender (female) Loyalty program Recency of past purchase/30 Past online purchase frequency
Treated Control
16
variables that we observe prior to the event. This pre-trend analysis gives us confidence in the
validity of our empirical strategy.
Econometric Model and Identification As described in the previous section, we estimate the effects of app failure on shopping outcomes
by relying on a quasi-experimental research design with a DID approach (e.g., Angrist and
Pischke 2009). Specifically, we leverage a systemwide failure shock and compare app users who
experience this shock with those who do not, given that they accessed the app on the day of the
failure.
Our two-period linear DID regression takes the following form:
(1) 𝑌!" =𝛼# +𝛼$𝐹! +𝛼%𝑃" +𝛼&𝐹!𝑃" +𝜗!"
where i is shopper, t is time period (pre- or post- failure time period), and Y is the outcome
variable (frequency, quantity, monetary value), F is a dummy variable denoting treatment (1 if
shopper i experienced the app failure and 0 otherwise), P is a dummy variable denoting the
period (1 for the period after the systemwide app failure and 0 otherwise), α is a coefficient
vector, and ϑ is an error term. We cluster standard errors at the shopper level, following Bertrand
et al. (2004). The coefficient of FiPt, i.e., 𝛼&, is the treatment effect of the app failure.5
The assumptions underlying the identification of this treatment effect are: (1) the failure is
random conditional on a shopper logging into the app during the time window of the failure
shock and (2) the change in outcomes for the non-failure experiencing app users is a valid
counterfactual for the change in outcomes that would have been observed for failure-
experiencing app users in the absence of the failure.
5 Because we analyze the short-term effect of a service failure (14 and 30 days), we do not have an adequate number of observations per shopper post failure for us to estimate shopper fixed effects in our analysis.
17
EMPIRICAL ANALYSIS RESULTS
Relationship between App Failures and Purchases We first examine the overall differences in post-failure behaviors between shoppers who
experienced failures and those who did not using model-free evidence 14 days pre and post
failure. We choose a 14-day window period because this two-week period is close to the mean
interpurchase time in our dataset of 11 days and will equally include any “day of the week”
effects in shopping.6
Table 2 reports the raw comparisons of post-failure vs. pre-failure purchase outcome
variables for both failure experiencers (70,568 treated) and non-experiencers (66,121 control)
among the set of consumers who accessed the app on the day of the failure. We find that post-
failure, shoppers who experienced the systemwide failure had .04 (p < .001) lower purchase
frequency, .07 (p < .001) lower purchase quantity, and $2.42 (p < .001) lower monetary value
than shoppers who did not experience the failure. A simple comparison of shopping outcomes
across the two groups shows that the average monetary value of purchases increased by 81.8%
($30.41 to $55.28) for failure-experiencers, while it increased by 87.6% ($30.75 to $57.70) for
non-failure experiencers post failure relative to the pre period (p < .001).7 Given our
identification strategy, the diminished growth in the monetary value of purchases for failure
experiencers relative to non-experiencers comes from the exogenous failure shock.
6 We also estimated a model with dynamic treatment effects for a longer period of four weeks pre- and post- the failure shock and found similar effects (see Figure 6 and Table D1). 7 Increasing sales trend between the pre- and post- period for both the groups is partially due to the April 19 weekend in the post period that witnessed the release of a new game.
18
Table 2 MODEL-FREE EVIDENCE: MEANS OF OUTCOME VARIABLES FOR TREATED AND CONTROL GROUPS
Variable Treated pre period
Treated post period
Control pre period
Control post period
Frequency of purchases .74 .89 .75 .93 Quantity of purchases 1.52 1.69 1.52 1.76 Value of purchases ($) 30.41 55.28 30.75 57.70 Frequency of purchases – Online .03 .04 .03 .04 Quantity of purchases – Online .05 .06 .05 .07 Value of purchases – Online ($) 1.34 2.93 1.50 3.17 Frequency of purchases – Offline .70 .85 .71 .88 Quantity of purchases – Offline 1.47 1.63 1.47 1.69 Value of purchases – Offline ($) 29.07 52.35 29.25 54.53
Notes: These statistics are based on pre- and post- 14 days of the failures. N = 273,378.
Main Diff-in-Diff Model Results
The results from the DID model in Table 3 show a negative and significant effect of app failure
on the frequency (𝛼& = -.024, p < .01), quantity (𝛼& = -.057, p < .01), and monetary value of
purchases (𝛼& = -2.181, p < .01) across channels. Relative to the pre-period for the control group,
the treated group experiences a decline in frequency of 3.20% (p < .01), quantity of 3.74% (p
< .01), and monetary value of 7.1% (p < .01).8
Table 3 DID MODEL RESULTS OF FAILURE SHOCKS FOR PURCHASES ACROSS CHANNELS
Variable Frequency of purchases
Quantity of purchases
Value of purchases
Failure experiencer x Post shock (DID)
-.024** (.008)
-.057** (.020)
-2.181** (.681)
Failure experiencer -.021** (.007)
-.030 (.018)
-.694* (.302)
Post shock .178*** (.006)
.236*** (.014)
26.947***(.497)
Intercept .750*** (.005)
1.523*** (.012)
3.755*** (.219)
R squared .004 .001 .018 Effect size -3.20% -3.74% -7.09% Mean Y .82 1.61 43.31
Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = 273,378. DID = Difference-in-Differences.
8 We calculate the percentage change by dividing the treatment coefficient by the intercept. For instance, the treatment coefficient for value of purchases (2.18) divided by intercept (30.76) amounts to a 7.1% change.
19
Next, we examine the channel spillover effects of app failures in greater depth. We split the
total value of purchases into offline and online purchases. Table 4 reports the results for these
alternative channel-based dependent variables. There is a negative and significant effect of app
failure on the frequency (𝛼& = -.02, p < .01), quantity (𝛼& = -.05, p < .01), and monetary value of
purchases (𝛼& = -2.09, p < .01) in the offline channel. Interestingly, we do not find a significant
(p > .10) effect of app failure on any of the purchase outcomes in the online channel. Because
there is no corresponding increase in the online channel and because the overall purchases drop,
we conclude that the decreases in overall purchases across channels are largely due to declines in
in-store purchases.
Table 4 DID MODEL RESULTS OF FAILURE SHOCKS FOR PURCHASES BY CHANNEL
Offline Online Variable Frequency of
purchases Quantity of purchases
Value of purchases
Frequency of purchases
Quantity of purchases
Value of Purchases
Failure experiencer x Post shock (DID)
-.022** (.008)
-.055** (.019)
-2.088** (.660)
-.002 (.002)
-.002 (.003)
-.093 (.154)
Failure experiencer
-.018** (.006)
-.025 (.017)
-.527 (.293)
-.003* (.001)
-.005* (.002)
-.167** (.064)
Post shock .170*** (.005)
.221*** (.014)
25.275*** (.482)
.009*** (.001)
.015*** (.002)
1.672*** (.113)
Intercept .714*** (.005)
1.470*** (.012)
29.255*** (.213)
.036*** (.001)
.054*** (.002)
1.500*** (.048)
R squared .0038 .0001 .0169 .0016 .0002 .0003 Effect size -3.08% -3.74% -7.14% - - - Mean Y .78 1.56 41.08 .04 .06 2.23 Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = 273,378. DID = Difference-in-Differences.
Mechanisms Behind the Effects of Failures on Shopping Outcomes
We now provide descriptive evidence for the potential mechanisms behind the results. The
overall negative effect of app failure on shopping outcomes across channels could be due to
decreases in intermediate outcomes such as shoppers’ engagement after failure. To explore this
20
possibility, we examine the effect of app failure on app engagement variables such as the number
of app sessions, the average dwell time per session, and the average number of app features used
in each session. The results of the corresponding DID model appear in Table 5. The treatment
effect of failure for each of the three variables is negative and significant (p < .001), suggesting
that app failure is associated with diminishing app engagement.
Table 5 DID MODEL RESULTS FOR APP ENGAGEMENT VARIABLES
Variable No. of app sessions
Average dwell time per session
Average no. of app features used
Failure experiencer x Post shock (DID)
-.689*** (.005)
-7.444*** (.072)
-4.678*** (.024)
Failure experiencer .651*** (.005)
7.041*** (.065)
4.508*** (.021)
Post shock -.624*** (.003)
-3.525*** (.047)
-2.558*** (.016)
Intercept .727*** (.003)
4.654*** (.040)
3.067*** (.013)
R squared .4211 .1833 .4843
Mean Y .57 4.61 2.91 Notes: Robust standard errors clustered by shoppers are in parentheses; the app engagement variables are measured 5 hours pre- and post- failure; *** p < .001, ** p < .01, * p < .05. N = 273,378. DID = Difference-in-Differences.
The differential effect of app failure across the channels could be explained by the co-
occurrence of two countervailing forces: channel substitution and brand preference dilution. The
channel substitution effect occurs when app failure experiencers move to the mobile web
browser, the desktop website, or the physical store to complete their intended purchase. If
shoppers switch channels to complete their intended purchase, we should not observe negative
effects of the failure in the channels of their subsequent purchases. We may potentially see even
positive effects if the switch to the other channel leads to greater purchases in that channel than
would have occurred in the online channel. Brand preference dilution happens when app failure
experiencers get annoyed or dissatisfied with the retailer and lower their future purchases overall,
including in the store. It is possible that channel substitution effect and brand preference dilution
21
operate when shoppers experience the app failure at different stages of the purchase funnel.
Shoppers who are close to purchase at the time of app failure may quickly switch channels and
complete their purchase through the mobile or desktop website forms of the online channel.
However, shoppers who are far from purchase when the app fails may prefer the retailer brand
less and buy less than what they had planned to in the future, perhaps because they switched to
competing retailers instead.
To explore the role of stage in the purchase funnel in explaining the differential effects of app
failure, we first examine the effects of app failure across shoppers based on whether they are
close to or far from purchase at the time of failure. For this analysis, we utilize information in the
data about the type of page on which the shopper was when the failure occurred. Table 6 reports
the DID model results when the app failure occurred on purchase related and non-purchase
related pages. Purchase-related pages in an app involve pages that are closer to purchase, such as
those relating to adding a product to shopping cart, clicking checkout, or making payments. In
contrast, non- purchase-related pages involve pages whose actions are farther from purchase,
such as browsing products or obtaining store related information. The effect of app failure is
negative and significant (p < .001) on all the outcome variables for shoppers who experience
failure on a non-purchase related page than for shoppers who experience failure on a purchase
related page. Shoppers who already have a strong purchase intent and are on a purchase-related
page right before the failure are not as negatively affected as those without a strong purchase
intent or on a non-purchase related page.
22
Table 6. DID MODEL RESULTS FOR FAILURES OCCURRING ON PURCHASE AND NON-PURCHASE RELATED PAGES
Failure on purchase related page Failure on non-purchase related page Variable Frequency of
purchases Quantity of purchases
Value of purchases
Frequency of purchases
Quantity of purchases
Value of purchases
Failure experiencer x Post shock (DID)
.000 (.013)
-.016 (.038)
.907 (1.195)
-.053*** (.009)
-.108*** (.022)
-4.627*** (.763)
Failure experiencer
-.019 (.011)
-.016 (.034)
-.246 (.52)
-.041*** (.007)
-.075*** (.019)
-1.208*** (.341)
Post shock .178*** (.006)
.236*** (.014)
26.947*** (.497)
.178*** (.006)
.236*** (.014)
26.947*** (.497)
Intercept .750*** (.005)
1.523*** (.012)
30.755*** (.219)
.750*** (.005)
1.523*** (.012)
30.755*** (.219)
R squared .004 .001 .019 .004 .001 .018 Mean Y .836 1.637 44.270 .813 1.591 42.850 Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. DID = Difference-in-Differences. N = 160,662 for failure on purchase related page. N= 217,418 for failure on non-purchase related page.
To further explore the role of the purchase funnel, we compare the change in the value of
purchases between the post and the pre app failure time periods for two groups of shoppers, close
to and far from purchase based on a median split of re-login attempts during the failure window.
The median number of attempts is three. The negative effect of failure for shoppers who make
greater re-login attempts is lower (Value of purchases(post-pre, high attempt) = 28.03, Value of
purchases(post-pre, control) = 26.95, p > .01) than for shoppers who make fewer re-login attempts
(Value of purchases(post-pre, low attempt) = 20.33, Value of purchases(post-pre, control) = 26.95, p < .001).
The group of shoppers who are close to purchase at the time of app failure are likely to
repeatedly attempt to re-login during the failure duration to complete their intended purchase.
Such shoppers may eventually make the purchase in another channel, resulting in channel
substitution. However, the group of shoppers who are far from purchase at the time of failure,
make fewer attempts to log back during the failure time window. A greater negative effect of app
failure for such shoppers may be due to brand preference dilution.
23
Failure-experiencers who were close to a purchase or had purchase intent, would have had to
determine whether to complete the transaction, and if so, whether to do it online or offline. For
shoppers who typically buy online, the cost of going to the retailer’s website to complete a
purchase interrupted by the app failure is smaller than that of going to the store to complete the
purchase. Therefore, these shoppers will likely complete the transaction online and not exhibit
any significant decrease in shopping outcomes in the online channel post failure. Thus, channel
substitution effect likely explains the insignificant effects of app failure in the online channel. By
contrast, shoppers who typically buy in the retailer’s brick-and-mortar stores and who experience
the app failure, will likely have a diminished perception of the retailer with fewer incentives to
buy from the stores in the future. Thus, brand preference dilution effect may prevail for these
shoppers after app failure. This effect is due to a negative spillover from the app channel to the
offline channel for shoppers experiencing the failure even if they are primarily offline shoppers.
Indeed, a negative message or experience can have an adverse spillover effect on attributes or
contexts outside the realm of the message or experience (Ahluwalia et al. 2001).
To further explore channel substitution toward the online channel, we examine the time
elapsed between the occurrence of the failure and subsequent purchase in the online channel.
Failure experiencers’ inter-purchase time online (Meantreated = 162.8 hours) is much shorter than
non-experiencers’ (Meancontrol = 180.7 hours) (p = .003). This result further suggests that after an
app failure, shoppers look to complete their intended purchases in the online channel.
Next, to understand channel substitution toward the offline channel, we examine the effect of
app failure for shoppers who were geographically close to a physical store at the time of failure.
Shoppers who are closer to a store when they experience the app failure could more easily
complete their purchase in the store than shoppers farther from a store. Table 7 reports the DID
24
model for the subsample of shoppers located within two miles of the retailer’s store at the time of
failure. The results show that shoppers closer to the store are not negatively affected by the
failure (p < .05). Rather surprisingly, both the basket size and the monetary value of purchases
for shoppers close to a store are significantly higher after app failure (p < .05). This result
suggests that shoppers who experience a failure close to or at a physical store end up buying
additional items in the store. Thus, app failure has an unintended positive effect on such
shoppers. An implication is that channel substitution to a store leads to more purchases, but that
channel substitution is less likely for shoppers who are farther from the store. However, the
proportion of shoppers close to the store at the time of failure is very small (2.4%), so the overall
effects of app failure on offline purchases and all purchases are still negative.
Table 7
DID MODEL RESULTS FOR VALUE OF PURCHASES AND BASKET SIZE BY CHANNEL FOR SHOPPERS CLOSE TO A STORE (< 2 MILES) AT THE TIME OF FAILURE
Offline Online Variable Value of
purchases Basket size Value of
purchases Basket size
Failure experiencer x Post shock (DID)
13.542* (5.307)
.134* (.058)
.885 (1.178)
.023 (.020)
Failure experiencer 2.419 (2.171)
.061 (.048)
-1.096* (.479)
-.012 (.013)
Post shock 37.833*** (3.150)
.109** (.036)
2.382** (.882)
.019 (.012)
Intercept 32.458*** (1.336)
.846*** (.031)
2.251*** (.390)
.059*** (.009)
R squared .0395 .0064 .0027 .0012 Mean Y 55.00 .95 3.18 .07
Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = 6,572. DID = Difference-in-Differences. Two miles is the median distance from store at the time of failure.
To further analyze the role of distance to the store at the time of failure, we present the
contrast analysis between shoppers who were less than two miles and those who were greater
than two miles from the nearest store at the time of failure in Table 8. The basket sizes of these
groups of shoppers do not differ post failure. However, shoppers closer to the store spend more
25
than those farther from the store post failure, suggesting that the app failure is associated with
channel substitution in purchases for shoppers closer to the store.
Table 8 CONTRAST ANALYSIS BASED ON DISTANCE TO STORE AT THE TIME OF FAILURE FOR FAILURE
EXPERIENCERS Variable Offline value of
purchases Offline basket size
Close to store x Post shock (DID)
14.130* (5.707)
.083 (.065)
Close to store 2.011 (2.357)
.051 (.051)
Post shock 40.511*** (3.675)
.159** (.046)
Intercept 34.021*** (1.593)
.855*** (.036)
R squared .0432 .0064 Mean Y 54.96 .98
Note: N = 5,650. Closeness to store is defined using the median distance of 2 miles. There are 1,298 failure-experiencers within 2 miles of the store at the time of failure and 1,527 failure-experiences who are 2 miles or farther from the store among those who opt-in for location sharing. *** p < .001, ** p < .01, * p < .05.
Finally, to better understand how channel substitution and brand preference dilution effects
may act on different failure-experiencing shopper groups purchasing in different channels, we
compare the effects of app failure on orders above the free shipping cost threshold value ($35) in
the online and offline channels. Failure experiencers who intended to order items valued above
the threshold can quickly substitute the app channel with the Web channel without additional
cost, so we expect the app failure to have little effect on their frequency of online purchases. The
results of a DID model for the frequency of online and offline purchases above the free shipping
threshold appear in Table 9. Indeed, app failure has no significant (p > .10) effect in the online
channel but a negative and significant (p < .05) effect in the offline channel on frequency of
purchases, suggesting offline shoppers significantly lower their preference and purchases after
the app failure. Thus, channel substitution appears to explain the null effect of app failure online,
while brand preference dilution seems to account for the negative effect of failure offline.
26
Table 9 DID MODEL RESULTS FOR THE AVERAGE NUMBER OF ORDERS ABOVE FREE SHIPPING ORDER
VALUE THRESHOLD Variable Average number
of online orders above $35
Average number of offline orders above $35
Failure experiencer x Post shock (DID)
.021 (.021)
-.010* (.005)
Failure experiencer -.008 (.016)
.007 (.004)
Post shock .118*** (.015)
.103*** (.004)
Intercept .414*** (.011)
.444*** (.003)
R-squared .017 .013 N 8,178 109,836 Mean Y .50 .48
Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = ?. DID = Difference-in-Differences.
We realize that much of the evidence for the mechanisms is descriptive and suggestive in
nature. Nevertheless, overall, the evidence is consistent with the asymmetry in the effect of app
failure on shopping outcomes across the two channels.
Next, we examine the heterogeneity in the sensitivity of shoppers to app failures in two ways.
We use a theory-based moderator approach as well as a data-driven machine learning approach.
Moderators: Relationship Strength and Prior Digital Use
The literatures on relationship marketing and service recovery suggest two factors may moderate
the impact of app failures on outcomes: relationship strength and prior digital channel use.
Relationship Strength. The service marketing literature offers mixed evidence on the
moderating role of the strength of customer relationship with the firm in the effect of service
failure on shopping outcomes. Some studies suggest that stronger relationship may aggravate the
effect of failures on product evaluation, satisfaction, and on purchases (Chandrashekaran et al.
2007; Gijsenberg et al. 2015; Goodman et al. 1995). Other studies show that stronger
relationship attenuates the negative effect of service failures (Hess et al. 2003; Knox and van
Oest 2014). Consistent with the direct marketing literature (Bolton 1998; Schmittlein et al.
27
1987), we operationalize customer relationship using RFM (recency, frequency, and monetary
value) dimensions. Because of high correlation between the interactions of frequency with
(failure experiencers x post shock) and value of purchases with (failure experience x post shock)
(r = .90, p < .001) and because value of purchases is more important for the retailer, we drop
frequency of past purchases.
Prior Digital Channel/Online Use/Experience. The moderating effect of a shopper’s prior
digital channel/online use or experience with the retailer on app failure’s impact on shopping
outcomes could be positive or negative. On the one hand, more digitally experienced app users
may be less susceptible to the negative impact of an app crash on subsequent engagement with
the app than less digitally experienced app users (Shi et al. 2017) because they are conditioned to
expect some level of technology failures, consistent with the product harm crises literature
(Cleeren et al. 2013; Liu and Shankar 2015) and the expectation-confirmation theory (Cleeren et
al. 2008; Oliver 1980; Tax et al. 1998,). On the other hand, prior digital exposure and experience
with the firm may heighten shopper expectations and make them less tolerant of failures. We
operationalize this variable as the cumulative number of purchases that the shopper made from
the retailer’s website prior to experiencing a failure.
The results of the model with relationship strength and past digital channel use as moderators
appear in Table 10. Consistent with our expectation, the monetary value of past purchases has
positive and significant interaction coefficients with the DID model variable across all the
outcome variables (p < .001). Thus, app failures have a smaller effect on shoppers with stronger
relationship with the retailer, consistent with the results of Ahluwalia et al. (2001). Recency has
negative coefficients (p < .001), suggesting that the more recent shoppers are less tolerant of
28
failure. A failure shock also affects the frequency, quantity, and value of purchases (p < .001) of
shoppers with greater digital channel or online purchase experience with the retailer more.
Table 10 DID MODEL RESULTS OF FAILURE SHOCKS FOR PURCHASES ACROSS CHANNELS:
MODERATING EFFECTS OF RELATIONSHIP WITH RETAILER AND PAST ONLINE PURCHASE FREQUENCY
Variable Frequency of purchases
Quantity of purchases
Value of purchases
Failure experiencer x Post shock (DID)
-.193*** (.012)
-.389*** (.03)
-12.935*** (.879)
DID x Past value of purchases .000*** (.000)
.000*** (.000)
.017*** (.001)
DID x Recency of purchases -.001*** (.000)
-.003*** (.000)
-.022*** (.006)
DID x Past online purchase frequency
-.019*** (.003)
-.029*** (.007)
-1.344*** (.220)
Past value of purchases .000*** (.000)
.001*** (.000)
.024*** (.000)
Recency .005*** (.000)
.009*** (.000)
.206*** (.003)
Past online purchase frequency .002 (.001)
.007 (.004)
-.638*** (.105)
Failure experiencer .007 (.007)
.037* (.017)
.589 (.497)
Post shock .178*** (.007)
.236*** (.017)
26.947*** (.505)
Intercept .640*** (.006)
1.141*** (.015)
25.034*** (.446)
R squared .159 .122 .093 Mean Y .839 1.643 44.070
Notes: DID = Difference-in-Differences. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = 273,378. Heterogeneity in Shoppers’ Sensitivity to App Failures (Treatment Effect)
In addition to the service marketing literature based moderator variables examined earlier, , we
also explore heterogeneity in treatment effects relating to additional managerially useful
observed variables (e.g., gender, loyalty level) not fully examined by prior research.
Unfortunately, including these variables as additional moderators in the DID analysis explodes
the number of main and interaction effects.
Recent methods of causal inference using machine learning such as “causal forest” allow us to
recover individual-level conditional average treatment effects (CATE) (Athey et al. 2017; Wager
29
and Athey 2018). The causal forest is an ensemble of causal trees that averages the predictions of
treatment effects produced by each tree for thousands of trees.9 It has been applied in marketing
to model customer churn and information disclosure (Ascarza 2018; Guo et al. 2018).
The estimates from causal forest using 1,000 trees appear in Web Appendix Table A1. About
96% of the shoppers have a negative value of CATE with an average of -1.739. The distribution
of CATE across shoppers appears in Web Appendix Figure A1. The shopper quintiles based on
CATE levels reflects this distribution in Web Appendix Figure A2, which shows that Segment 1
of the most sensitive shoppers exhibit higher variance than the rest.
Next, we regress the CATE estimate on the covariate space to identify the covariates that best
explain treatment heterogeneity. The results appear in Web Appendix Table A2. They show that
all the covariates, including gender and loyalty are significant (p < .001). Shoppers with higher
value of past purchases and more frequent online purchases are less sensitive to an app failure
than others. Shoppers who bought more recently in the past are less tolerant of an app failure.
Some of these results complement those from the moderator analysis.
The causal forest-derived CATE regression differs from the moderator DID regression in
important ways. First, the moderator regression uses the entire sample for estimation, while the
causal forest, the basis for the CATE regression, uses a subset of the data (the training sample)
for estimation. Second, the causal forest underlying the CATE regression splits the training data
further to estimate an honest tree, estimating from an even smaller subset of the moderator
regression sample. Third, relative to the linear moderator regression, the CATE regression can
handle a much larger number of covariates. Because of these differences, the results of the
CATE regression model may not exactly mirror those of the moderator regression model.
9 In Web Appendix A, we provide an overview of causal trees and describe the algorithm for estimating a single causal tree followed by bagging a large number of causal trees into a forest.
30
ROBUSTNESS CHECKS AND RULING OUT ALTERNATIVE EXPLANATIONS
We perform several robustness checks and tests to rule out alternative explanations for the effect
of app failure on purchases.
Alternative model specifications. Although the failure in our data is exogenous, to be sure, in
addition to our proposed DID model, we also estimate models with shopper covariates to
estimate the treatment effect of interest. Additionally, we estimate Poisson count data models for
the frequency and quantity variables. The results from these models replicate the findings from
Tables 3 and 4 and appear in the Web Appendix Tables B1-B2 and C1-C2, respectively. The
coefficients of the treatment effect from Table B1 and C1 represent changes in outcomes due to
app failures, conditioned on covariates. These results are substantively similar to those in Tables
3 and 4. The insensitivity of the results to control variables suggests that the effect of
unobservables relative to these observed covariates would have to be very large to significantly
change our results (Altonji et al. 2005). Similarly, the results are robust to a Poisson
specification, reported in Tables B2 and C2.
Outliers. We re-estimate the models by removing outliers (extremely heavy spenders who are
greater than three standard deviations away from the mean in monetary value of purchases in the
pre-period) from our data. Web Appendix Tables B3 and C3 report these results. We find the
results to be consistent with and even stronger than those reported earlier.
Existing shoppers. Another possible explanation for app failures’ effect can be that only new
or dormant shoppers are sensitive to failures, perhaps due to low switching costs. Therefore, we
remove those with no purchases in the last 12 months to see if their behavior is similar to that of
the existing shoppers. Indeed, Web Appendix Tables B4 and C4 report substantively similar
results after excluding the new or dormant shoppers.
31
Alternative measures of digital channel use moderators. In lieu of past online purchases
frequency as a measure of prior digital channel use, we use measures based on median split in
the number, share of online purchases, and prior app usage in the time between app launch and
server failure in the app. The results for alternative online purchase measures are almost the same
as our proposed model results, except for prior app usage. Shoppers who use the app more
frequently appear to be less sensitive to failures as shown in Web Appendix Tables B5 and C5.
Regression discontinuity analysis. To ensure that there are no unobservable differences
between failure experiencers and non-experiencers based on the time of login, we carry out a
‘regression discontinuity’ (RD) style analysis in the one hour before the start time of the service
failure. For the RD analysis, we consider only app users in the neighborhood of this time, using
as control group those users who logged in one hour before and after the failure period and as
treated the users who logged in during the failure period. The results are substantively similar to
our main model results and are reported in Web Appendix Tables B6 and C6.
Longer-term effect of failures. Our main analysis shows 14-day effects of app failures. To
explore if these effects continue over longer periods of time, we examined the outcomes four
weeks pre- and post- failure event. There is a steep fall in the period immediately after the
failure. However, purchases climb back to higher levels over the next three weeks. Nevertheless,
they return to levels lower than the pre-period average. Thus, the diminished impact of failure
persists over time. These patterns appear in Figure 6 and Web Appendix Table D1. The table
shows the coefficients of the interactions of weekly dummies with TREAT for a DID regression.
Because an app failure occurs every 7-8 weeks, we estimate the effects four weeks pre and post
so as to avoid our pre- or post- periods overlapping with other failures.
32
Figure 6 APP FAILURE EFFECTS ON VALUE OF PURCHASES OVER FOUR WEEKS
Note: The effects for all but one of the pre app failure weeks are insignificant. The horizontal line is average treatment effect.
Stacked model for channel effects. The results for online and offline purchases in Table 4 do
not show the relative sizes of the effects across the two channels. To examine these relative
effects, we estimate a stacked model of online and offline outcomes that includes a channel
dummy. The results for this model appear in Web Appendix Table D2. We interpret the effects
as a proportion of the purchases within the channel and conclude that the effects in the offline
channel are more negative than those in the online channel (p < .01). We also estimated a DID
regression model with value of purchases in the offline channel as a proportion of total purchases
and found negative and significant effects of failure (p < .01).
DISCUSSION, MANAGERIAL IMPLICATIONS, AND LIMITATIONS
Summary
In this paper, we addressed novel research questions: What is the effect of a service failure in a
retailer’s mobile app on the frequency, quantity, and monetary value of purchases in online and
offline channels? What possible mechanisms may explain these effects? How do shoppers’
relationship strength and prior digital channel use moderate these effects? How heterogeneous is
shoppers’ sensitivity to failures? By answering these questions, our research fills an important
gap at the crossroads of three disparate streams of research in different stages of development:
33
the mature stream of service failures, the growing stream of omnichannel marketing, and the
nascent stream of mobile marketing. We leveraged a random systemwide failure in the app to
measure the causal effect of app failure. To our knowledge, this is the first study to causally
estimate the effects of digital service failure using real world data. Using unique data spanning
online and offline retail channels, we examined the spillover effects of such failures across
channels and examined heterogeneity in these effects based on channels and shoppers.
Our results reveal that app failures have a significant negative effect on shoppers’ frequency,
quantity, and monetary value of purchases across channels. These effects are heterogeneous
across channels and shoppers. Interestingly, the overall decreases in purchases across channels
are driven by reductions in store purchases and not in digital channels. Furthermore, we find that
that shoppers with higher monetary value of past purchases are less sensitive to app failures.
Overall, our nuanced analyses of the mechanisms by which an app failure affects purchases
offer new and insightful explanations in a cross-channel context. Our findings are consistent with
the view that some customers may be tolerant of technological failures (Meuter et al. 2000).
Finally, our study offers novel insights into the cross-channel implications of app failures.
Economic Significance
The economic effects of failures are sizeable for any retailer to alter its service failure preventive
and recovery strategies. Based on our estimates, the economic impact of an app failure is a
revenue loss of about $.48 million.10 The retailer experiences about 5-7 failures each year,
resulting in an annual loss of $2.4-$3.4 million. This loss may not amount to a sizeable portion of
10 We compute this figure by using the weekly effect coefficients in Table D1, i.e., $(.82 + 1.70 + .58 + .53)*N for the first four weeks and $(.64)*5*N assuming that the fifth week’s effects remain for another five weeks until the next failure for N = 70,568 failure experiencers, totaling $.48 million.
34
the retailer’s annual revenues. However, given the low retail margins and the retailer’s
vulnerable financial condition, it forms a substantial amount for the retailer.
The economic effect is meaningful for several reasons. First, retailers operate on thin margins
(2-3% in many categories) and are cost-conscious, such an economic loss is impactful. Second,
the effect size of 7.1% from our results is consistent with and even higher than those from other
similar causal studies. For example, exposure to banner advertising has been shown to lift
purchase intention by .473% worth 42 cents/click to the firm (Goldfarb and Tucker 2011).
Goldfarb and Tucker (2011) argue, “although the coefficient may seem small, it suggests an
economically important impact of online advertising.” Third, in the mobile context, the effect of
being in a crowd (of five people relative to two per square meter when receiving a mobile
promotion) results in an economically meaningful 2.2% more clicks (Andrews et al. 2015).
Fourth, Akca and Rao (2020) argue that a revenue drop of $5.32 million is economically
significant for a large company such as Orbitz. Fifth, as sales through the mobile app and online
sales are growing rapidly, this effect is only getting larger. Sixth, our estimates are for one two-
hour app failure in a year. Finally, the effects continue over a longer five-week period.
Managerial Implications
Service failure and low-quality service likely lead to termination (Sriram et al. 2015). The
insights from our research better inform executives in managing their mobile app and channels
and offer practitioner implications for service failure preventive and recovery strategies.
Preventive Strategies. Managers can use the estimate that an app failure results in a 7.1%
decrease in monetary value of purchases to budget resources for their efforts to prevent or reduce
app failures. The result that the adverse effect of failure is lower for shoppers closer to purchase
and purchasing less recently offers interesting pointers for retailers to prevent damage to their
35
brands and revenues. In general, managers should encourage shoppers to use the app more, get
closer to purchase, and purchase more through the app. Managers could offer limited-time
incentives to shoppers who have not clicked the checkout or purchase tabs in the app.
By identifying failure-sensitive shoppers based on relationship strength, prior digital use, and
individual-level CATE estimates, managers can take proactive actions to prevent these shoppers
from reducing their shopping intensity with the firm. Figure 7 represents the loss of revenues
(spending) from each percentile of shoppers at different levels of failure sensitivity.
Figure 7 RETAILER’S REVENUE LOSS BY PERCENTILE OF SHOPPERS EXPERIENCING APP FAILURE
Note: CATE = Conditional Average Treatment Effect.
About 70% of the losses in revenues due to failure arise from just 47% of the shoppers.
Managers can manage these shoppers’ expectations through email and app notification
messaging channels. Warning shoppers of typical number of disruptions in the app can preempt
negative attributions and attitudes, and limit potential brand dilution and drop in revenues due to
app failure.
Recovery Strategies. The finding that app failures result in reduced purchases across channels
suggests that managers should develop interventions and recovery strategies to mitigate the
negative effects of app failures not just in the mobile channel, but also in other channels, in
particular, the offline channel. Thus, seamlessly integrating data from a mobile app with data
36
from its stores and websites can help an omnichannel retailer build continuity in shoppers’
experiences.
Immediately after a shopper experiences an app failure, the manager of the app should
provide gentle nudges and even incentives for the shopper to complete an abandoned transaction
on the app. Typically, a manager may need to provide these nudges and incentives through other
communication channels such as email, phone call, or face-to-face chat. These nudges are similar
in spirit and execution to those from firms like Fitbit and Amazon, who remind customers
through email to reconnect when they disconnect their watch and smart speaker, respectively. If
the store is a dominant channel for the retailer, the retailer should use its store associates to
reassure or incentivize shoppers. In some cases, managers can even offer incentives in other
channels to complete a transaction disrupted by an app failure.
Because diminished purchases after failure result from reduced engagement, managers
should aim to enhance engagement after a systemwide failure. Once a failure is restored,
managers could induce shoppers to use the app more through gamification features in the app or
providing enhanced loyalty points for logging back into the app.
The finding that app failure can enhance spending for shoppers experiencing the failure close
to the store offers useful cross-selling opportunities for the retailer. After a systemwide failure is
resolved, retailers can proactively promote in the store nearest to each failure-experiencing
shopper products based on the shoppers’ purchase history.
Managers should mitigate the negative effects of app failures for the most sensitive shoppers
first. They should proactively identify failure-sensitive shoppers and design preemptive
strategies to mitigate any adverse effects. We find that shoppers with weaker relationship with
the provider are more sensitive to failures. Thus, firms should address such shoppers for recovery
37
after a careful cost-benefit analysis. This is important because apps serve as a gateway for future
purchases for these shoppers.
Finally, our analysis of heterogeneity in shoppers’ sensitivity to app failures suggests that
managers should satisfy first the shoppers with the highest values of CATE. Interventions
targeted at the 47% of the shoppers who contribute to 70% of losses could lead to higher returns.
Limitations
Our study has limitations that future research can address. First, we have data on a limited
number of failures, so we could not fully explore all the failures with varying durations. Second,
our results are most informative for similar retailers that have a large brick-and-mortar presence
but growing online and in-app purchases. If data are available, future research could study app
failures for primarily online retailers with an expanding offline presence (e.g., Bonobos, Warby
Parker). Third, we do not have data on competing apps that shoppers may use. Additional
research could study shoppers’ switching behavior if data on competing apps are available.
Fourth, our data contain relatively low number of purchases in the mobile channel. For better
generalizability of the extent of spillover across channels, our analysis could be extended to
contexts in which a substantial portion of purchases are made within the app. Fifth, we do not
have data on purchases made through the app vs. mobile browser. Studying differences between
these two mobile sub-channels is a fruitful future research avenue. Finally, mobile apps may be
an effective way to recover from the adverse effects of service failures (Tucker and Yu 2018).
Our approach also provides a way to identify app-failure sensitive shoppers, but we do not have
data on shoppers’ responses to service recovery to recommend the best mitigation strategy. The
strategies we do recommend could be tested in ethically permissible field studies.
38
REFERENCES
Ahluwalia, Rohini, H. Rao Unnava, and Robert E. Burnkrant (2001), “The Moderating Role of Commitment on the Spillover Effect of Marketing Communications,” Journal of Marketing Research, 38 (4), 458–70.
Akca, Selin and Anita Rao (2020), “Value of Aggregators,” Marketing Science, 39 (5), 893–922. Altonji, Joseph G., Todd E. Elder, and Christopher R. Taber (2005), “Selection on Observed and
Unobserved Variables: Assessing the Effectiveness of Catholic Schools,” Journal of Political Economy, 113 (1), 151–84.
Andreassen, Tor Wallin (2016), “What Drives Customer Loyalty with Complaint Resolution?” Journal of Service Research, 1 (4), 324-32.
Andrews, Michelle, Xueming Luo, Zheng Fang, and Anindya Ghose (2015), “Mobile Ad Effectiveness: Hyper-Contextual Targeting with Crowdedness,” Marketing Science, 35 (2), 218–33.
Angrist, Joshua D. and Jörn-Steffen Pischke (2009), Mostly Harmless Econometrics: An Empiricist’s Companion, Princeton: Princeton University Press.
Ansari, Asim, Carl F. Mela, and Scott A. Neslin (2008), “Customer Channel Migration,” Journal of Marketing Research, 45 (1), 60-76.
Athey, Susan and Guido Imbens (2016), “Recursive Partitioning for Heterogeneous Causal Effects,” Proceedings of the National Academy of Sciences, 113 (27), 7353-7360.
Athey, Susan, Guido Imbens, Thai Pham, and Stefan Wager (2017), “Estimating Average Treatment Effects: Supplementary Analyses and Remaining Challenges,” American Economic Review, 107 (5), 278–81.
Avery, Jill, Thomas J. Steenburgh, John Deighton, and Mary Caravella (2012), “Adding Bricks to Clicks: Predicting the Patterns of Cross-Channel Elasticities Over Time,” Journal of Marketing, 76 (3), 96–111.
Barron’s (2018), “Walmart: Can It Meet Its Digital Sales Growth Targets?,” (accessed November 5, 2020), [available at https://www.barrons.com/articles/walmart-can-it-meet-its-digital-sales-growth-targets-1519681783].
Bell, David R., Santiago Gallino, and Antonio Moreno (2018), “Offline Showrooms in Omni-channel Retail: Demand and Operational Benefits, Management Science, 64 (4), 1629-51.
Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan (2004), “How Much Should We Trust Differences-In-Differences Estimates?” The Quarterly Journal of Economics, 119 (1), 249–75.
Bitner, Mary Jo, Bernard H. Booms, and Mary Stanfield Tetreault (1990), “The Service Encounter: Diagnosing Favorable and Unfavorable Incidents,” Journal of Marketing, 54 (1), 71–84.
Blancco (2016), “The State of Mobile Device Performance and Health: Q2,” (accessed November 5, 2020), [available at https://www2.blancco.com/en/research-study/state-of-mobile-device-performance-and-health-trend-report-q2-2016].
Bolton, Ruth N. (1998), “A Dynamic Model of the Duration of the Customer’s Relationship with a Continuous Service Provider: The Role of Satisfaction,” Marketing Science, 17 (1), 45–65.
Brynjolfsson, Erik, Yu Jeffery Hu, and Mohammad S Rahman (n.d.), “Competing in the Age of Omnichannel Retailing,” MIT Sloan Management Review, (accessed November 5, 2020), [available at https://sloanreview.mit.edu/article/competing-in-the-age-of-omnichannel-retailing/].
Chandrashekaran, Murali, Kristin Rotte, Stephen S. Tax, and Rajdeep Grewal (2007), “Satisfaction Strength and Customer Loyalty,” Journal of Marketing Research, 44 (1), 153–63.
39
Chintagunta, Pradeep K., Junhong Chu, and Javier Cebollada (2011), “Quantifying Transaction Costs in Online/Off-line Grocery Channel Choice,” Marketing Science, 31 (1), 96–114.
Cleeren, Kathleen, Marnik G. Dekimpe, and Kristiaan Helsen (2008), “Weathering Product-harm Crises,” Journal of the Academy of Marketing Science, 36 (2), 262–70.
Cleeren, Kathleen, Harald J. van Heerde, and Marnik G. Dekimpe (2013), “Rising from the Ashes: How Brands and Categories can Overcome Product-Harm Crises,” Journal of Marketing, 77 (2), 58-77.
Computerworld (2014), “iOS 8 app crash rate falls 25% since release,” Computerworld, (accessed November 5, 2020), [available at https://www.computerworld.com/article/2841794/ios-8-app-crash-rate-falls-25-since-release.html].
Dimensional Research (2015), “Mobile User Survey: Failing to Meet User Expectations,” TechBeacon, (accessed November 5, 2020), [available at https://techbeacon.com/resources/survey-mobile-app-users-report-failing-meet-user-expectations].
Dotzel, Thomas, Venkatesh Shankar, and Leonard L. Berry (2013), “Service Innovativeness and Firm Value,” Journal of Marketing Research, 50 (2), 259-76.
Fong, Nathan M., Zheng Fang, and Xueming Luo (2015), “Geo-Conquesting: Competitive Locational Targeting of Mobile Promotions,” Journal of Marketing Research, 52 (5), 726–35.
Forbes, Lukas P. (2008), “When Something Goes Wrong and No One is Around: Non‐internet Self‐service Technology Failure and Recovery,” Journal of Services Marketing, 22 (4), 316–27.
Forbes, Lukas P., Scott W. Kelley, and K. Douglas Hoffman (2005), “Typologies of e‐commerce retail failures and recovery strategies,” Journal of Services Marketing, (S. Baron, K. Harris, and D. Elliott, eds.), 19 (5), 280–92.
Forman, Chris, Anindya Ghose, and Avi Goldfarb (2009), “Competition between Local and Electronic Markets: How the Benefit of Buying Online Depends on Where You Live,” Management Science, 55 (1), 47–57.
Ghose, Anindya, Hyeokkoo Eric Kwon, Dongwon Lee, and Wonseok Oh (2018), “Seizing the Commuting Moment: Contextual Targeting Based on Mobile Transportation Apps,” Information Systems Research, 30 (1), 154-74.
Gijsenberg, Maarten J., Harald J. Van Heerde, and Peter C. Verhoef (2015), “Losses Loom Longer than Gains: Modeling the Impact of Service Crises on Perceived Service Quality over Time,” Journal of Marketing Research, 52 (5), 642-56.
Goldfarb, Avi and Catherine Tucker (2011), “Online Display Advertising: Targeting and Obtrusiveness,” Marketing Science, 30 (3), 389–404.
Google (2020), “Find Out How You Stack Up to New Industry Benchmarks for Mobile Page Speed,” Think with Google, (accessed November 5, 2020), [available at https://www.thinkwithgoogle.com/marketing-strategies/app-and-mobile/mobile-page-speed-new-industry-benchmarks/].
Google M/A/R/C Study (2013), “Mobile in-store Research: How In-store Shoppers are Using Mobile Devices,” Google M/A/R/C.
Guo, Tong, S. Sriram, and Puneet Manchanda (2017), “The Effect of Information Disclosure on Industry Payments to Physicians,” SSRN Scholarly Paper, Rochester, NY: Social Science Research Network.
Halbheer, Daniel, Dennis L. Gärtner, Eitan Gerstner, and Oded Koenigsberg (2018), “Optimizing Service Failure and Damage Control,” International Journal of Research in Marketing, 35 (1), 100–15.
40
Hansen, Nele, Ann-Kristin Kupfer, and Thorsten Hennig-Thurau (2018), “Brand Crises in the Digital Age: The Short- and Long-term Effects of Social Media Firestorms on Consumers and Brands,” International Journal of Research in Marketing, 35 (4), 557–74.
Hess, Ronald L., Shankar Ganesan, and Noreen M. Klein (2003), “Service Failure and Recovery: The Impact of Relationship Factors on Customer Satisfaction,” Journal of the Academy of Marketing Science, 31 (2), 127–45.
Hoffman, K. Douglas and John E. G. Bateson (2001), Essentials of Services Marketing: Concepts, Strategies and Cases, Fort Worth: South-Western College Pub.
Kim, Su Jung, Rebecca Jen-Hui Wang, and Edward C. Malthouse (2015), “The Effects of Adopting and Using a Brand’s Mobile Application on Customers’ Subsequent Purchase Behavior,” Journal of Interactive Marketing, 31, 28–41.
Knox, George and Rutger van Oest (2014), “Customer Complaints and Recovery Effectiveness: A Customer Base Approach,” Journal of Marketing, 78 (5), 42-57.
Liu, Yan and Venkatesh Shankar (2015), “The Dynamic Impact of Product-Harm Crises on Brand Preference and Advertising Effectiveness: An Empirical Analysis of the Automobile Industry,” Management Science, 61 (10), 2514–35.
Ma, Liye, Baohong Sun, and Sunder Kekre (2015), “The Squeaky Wheel Gets the Grease—An Empirical Analysis of Customer Voice and Firm Intervention on Twitter,” Marketing Science, 34 (5), 627–45.
McCollough, Michael A., Leonard L. Berry, and Manjit S. Yadav (2016), “An Empirical Investigation of Customer Satisfaction after Service Failure and Recovery,” Journal of Service Research, 3 (2), 121-37.
Meuter, Matthew L., Amy L. Ostrom, Robert I. Roundtree, and Mary Jo Bitner (2018), “Self-Service Technologies: Understanding Customer Satisfaction with Technology-Based Service Encounters,” Journal of Marketing, 64 (3), 50-64.
Narang, Unnati and Venkatesh Shankar (2019), “Mobile App Introduction and Online and Offline Purchases and Product Returns,” Marketing Science, 38 (5), 756–72.
National Retail Federation (2018), “Top 100 Retailers 2018,” NRF, (accessed November 5, 2020), [available at https://nrf.com/resources/top-retailers/top-100-retailers/top-100-retailers-2018].
Neumann, Nico, Catherine E Tucker, and Timothy Whitfield (2019), “Frontiers: How Effective Is Third-Party Consumer Profiling? Evidence from Field Studies,” Marketing Science, 38 (6), 918-26.
Oliver, Richard L. (1980), “A Cognitive Model of the Antecedents and Consequences of Satisfaction Decisions,” Journal of Marketing Research, 17 (4), 460–69.
Pauwels, Koen and Scott A. Neslin (2015), “Building With Bricks and Mortar: The Revenue Impact of Opening Physical Stores in a Multichannel Environment,” Journal of Retailing, Multi-Channel Retailing, 91 (2), 182–97.
Schmittlein, David C., Donald G. Morrison, and Richard Colombo (1987), “Counting Your Customers: Who Are They and What Will They Do Next?” Management Science, 33 (1), 1–24.
Shi S, Kalyanam K, Wedel M (2017), “What Does Agile and Lean Mean for Customers? An Analysis of Mobile App Crashes. Working Paper, Santa Clara University.
Smith, Amy K. and Ruth N. Bolton (1998), “An Experimental Investigation of Customer Reactions to Service Failure and Recovery Encounters: Paradox or Peril?,” Journal of Service Research, 1 (1), 65–81.
Sriram, S., Pradeep K. Chintagunta, and Puneet Manchanda (2015), “Service Quality Variability and Termination Behavior,” Management Science, 61 (11), 2739–59.
41
Tax, Stephen S., Stephen W. Brown, and Murali Chandrashekaran (1998), “Customer Evaluations of Service Complaint Experiences: Implications for Relationship Marketing,” Journal of Marketing, 62 (2), 60–76.
Tucker, Catherine E, and Shuyi Yu (2019), “Does IT lead to More Equal treatment? An Empirical Study of the Effect of Smartphone use on Customer Complaint Resolution,” Working Paper, Massachusetts Institute of Technology.
Wager, Stefan and Susan Athey (2018), “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests,” Journal of the American Statistical Association, 113 (523), 1228–42.
Wang, Kitty and Avi Goldfarb (2017), “Can Offline Stores Drive Online Sales?” Journal of Marketing Research, 54 (5), 706-19.
Xu, Kaiquan, Jason Chan, Anindya Ghose, and Sang Pil Han (2016), “Battle of the Channels: The Impact of Tablets on Digital Commerce,” Management Science, 63 (5), 1469–92.
i
WEB APPENDIX A CAUSAL FOREST
Causal Trees: Overview
A causal tree is similar to a regression tree. The typical objective of a regression tree is to build
accurate predictions of the outcome variable by recursively splitting the data into subgroups that
differ the most on the outcome variable based on covariates. A regression tree has
decision/internal/split nodes characterized by binary conditions on covariates and leaf or terminal
nodes at the bottom of the tree. The regression tree algorithm continuously partitions the data,
evaluating and re-evaluating at each node to determine (a) whether further splits would improve
prediction, and (b) the covariate and the value of the covariate on which to split. The goodness-
of-fit criterion used to evaluate the splitting decision at each node is the mean squared error
(MSE) computed as the deviation of the observed outcome from the predicted outcome. The tree
algorithm continues making further splits as long as the MSE decreases by more than a specified
threshold.
The causal tree model adapts the regression tree algorithm in several ways to make it
amenable for causal inference. First, it explicitly moves the goodness-of-fit-criterion to treatment
effects rather than the MSE of the outcome measure. Second, it employs “honest” estimates, that
is, the data on which the tree is built (splitting data) are separate from the data on which it is
tested for prediction of heterogeneity (estimating data). Thus, the tree is honest if for a unit i in
the training sample, it only uses the response Yi to estimate the within-leaf treatment effect, or to
decide where to place the splits, but not both (Athey and Imbens 2016; Athey et al. 2017). To
avoid overfitting, we use cross-validation approaches in the tree-building stage.
ii
Importantly, the goodness-of-fit criterion for causal trees is the difference between the
estimated and the actual treatment effect at each node. While this criterion ensures that all the
degrees of freedom are used well, it is challenging because we never observe the true effect.
Causal Tree: Goodness-of-fit Criterion
Following Wager and Athey (2018), if we have n independent and identically distributed training
examples labeled i = 1, ..., n, each of which consists of a feature vector Xi Î [0, 1]d, a response
Yi Î R, and a treatment indicator Wi Î [0, 1], the CATE at x is:
(2) 𝜏(𝑥) = 𝔼[𝑌!$ −𝑌!#|𝑋! = 𝑥]
We assume unconfoundedness, i.e., conditional on Xi, the treatment Wi is independent of
outcome Yi. Because the true treatment effect is not observed, we cannot directly compute the
goodness-of-fit criterion for creating splits in a tree. This goodness-of-fit criterion is as follows.
(3) 𝑄!'()*+!,-) = 𝔼[((𝜏!(𝑋!) −𝜏.5(𝑋𝑖))%]
Because 𝜏!(𝑋!) is not observed, we follow Athey and Imbens’s (2016) approach to create a
transformed outcome 𝑌!∗that represents the true treatment effect. Assume that the treatment
indicator Wi is a random variable. Suppose there is a 50% probability for a unit i to be in the
treated or the control group, an unbiased true treatment effect can be obtained for that unit by just
using its outcomes Y in the following way. Let
(4) 𝑌!∗ = 2𝑌! 𝑖𝑓𝑊! = 0and 𝑌!∗ = −2𝑌! 𝑖𝑓𝑊! = 1
It follows that:
(5)𝔼[𝑌!∗] = 2. ($%𝔼[𝑌!(1)] −
$%𝔼[𝑌!(0)]) = 𝔼[𝜏!]
Therefore, we can compute the goodness-of-fit criterion for deciding node splits in a causal
tree using the expectation of the transformed outcome (Athey and Imbens 2016). Once we
generate causal trees, we can compute the treatment effect within each leaf because it has a finite
iii
number of observations and standard asymptotics apply within a leaf. The differences in the
treated and control units’ outcomes within each leaf produces the treatment effect in that leaf.
Causal Forest Ensemble
In the final step, we create an ensemble of trees using ideas from model averaging and bagging.
Specifically, we take predictions from thousands of trees and average over them (Guo et al.
2018). This step retains the unbiased, honest nature of tree-based estimates but reduces the
variance. The forest averages over the estimates from B trees in the following manner.
(6) 𝜏5(𝑥) = 𝐵0$∑ 𝜏5,(𝑥)1,2$
Because monetary value of purchases is the key outcome variable of interest to the retailer, we
estimate individual level treatment effect on value of purchases for each failure experiencer
separately using the observed covariate data. These covariates include gender and loyalty
program in addition to the three theoretically-driven moderators, namely, value of past
purchases, recency of past purchases and online buying/digital experience.11 These individual
attributes are important for identifying individual-level effects and for developing targeting
approaches (e.g., Neumann et al. 2019). We use a random sample of two-thirds of our data as
training data and the remaining one-third as test data for predicting CATE. We use half of the
training data to maintain honest estimates and for cross-validation to avoid overfitting. The
results appear in Tables A1 and A2.
Table A1 CAUSAL FOREST RESULTS: SUMMARY OF INDIVIDUAL SHOPPER TREATMENT EFFECT FOR VALUE
OF PURCHASES Ntest Mean SD
𝝉" 45,563 -1.660 1.136 𝝉"| 𝝉" < 0 43,748 -1.739 1.089 𝝉"| 𝝉" > 0 1,815 .239 .198
Note:�̂� represents the estimated Conditional Average Treatment Effect (CATE) for each individual in the test data.
11 Age and zip code information were not available for all the shoppers in our data period because the retailer followed strict privacy guidelines.
iv
Table A2
RESULTS OF CAUSAL FOREST POST-HOC CATE REGRESSION FOR VALUE OF PURCHASES Variable Coefficient (Standard Error) Intercept -.958***(.012) Past value of purchases .000***(.000) Recency of purchases -.005***(.000) Past online purchase frequency .037***(.002) Gender (female) -.190***(.008) Loyalty program -.340***(.011) R squared .493
Note: *** p < .001. N = 45,563
Figure A1 CAUSAL FOREST RESULTS: INDIVIDUAL CATE
Figure A2 CAUSAL FOREST RESULTS: QUINTILES BY CATE
Note: Segment 1 represents shoppers most adversely affected by failure while Segment 5 represents those who are least adversely affected.
v
WEB APPENDIX B ROBUSTNESS CHECK FOR TABLE 3 (MAIN TREATMENT EFFECT) RESULTS
In this section, we present the results for robustness checks for the main estimation in Table 3 relating to: (a) alternative models with covariates and using Poisson model (Tables B1-B2), (b) outliers (Table B3), (c) existing shoppers (Table B4), (d) alternative measures for prior use of digital channels (Table B5), and (e) regression-discontinuity style analysis (Table B6).
Table B1 ROBUSTNESS OF TABLE 3 RESULTS TO INCLUSION OF COVARIATES ACROSS CHANNELS
Variable Frequency of purchases
Quantity of purchases
Value of purchases
Failure experiencer x Post shock (DID)
-.025* (.011)
-.063* (.027)
-2.092** (.763)
Failure experiencer
-.018* (.008)
-.024 (.019)
-.624 (.539)
Post shock .180*** (.008)
.238*** (.019)
27.182*** (.549)
Gender -.050*** (.011)
-.112*** (.028)
-3.367*** (.809)
Loyalty program
-.171*** (.006)
-.416*** (.015)
-8.733*** (.415)
Intercept .813*** (.006)
1.678*** (.014)
33.900*** (.395)
R squared .0114 .0081 .0221 Mean Y .82 1.61 43.31
Notes: N = 273,378. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. DID = Difference-in-Differences.
Table B2 DID POISSON MODEL RESULTS ACROSS CHANNELS
Variable Frequency of purchases
Quantity of purchases
Failure experiencer x Post shock (DID)
-.0209* (.0124)
-.0309*** (.0158)
Failure experiencer -.0282*** (.0089)
-.0197*** (.0117)
Post shock .2133*** (.0089)
.1440*** (.0111)
Intercept -.2872*** (.0063)
.4209*** (.0081)
Log pseudo-likelihood -378,710 -711,963
Mean Y .82 1.61 Notes: Robust standard errors in parentheses; *** p < .001, ** p < .01, * p < .05. N = 273,378. DID = Difference-in-Differences.
vi
Table B3 ROBUSTNESS OF TABLE 3 RESULTS TO OUTLIER SPENDERS Variable Frequency
of purchases Quantity of purchases
Value of purchases
Failure experiencer x Post shock (DID)
-.023* (.010)
-.055* (.025)
-2.139** (.723)
Failure experiencer -.022** (.007)
-.031 (.018)
-.742 (.511)
Post shock .184*** (.007)
.256*** (.018)
27.439*** (.519)
Intercept .739*** (.005)
1.489*** (.013)
29.907*** (.367)
R squared .004 .001 .012 Mean Y .81 1.59 42.69
Notes: N = 272,706. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. DID = Difference-in-Differences.
Table B4
ROBUSTNESS OF TABLE 3 RESULTS TO EXISTING SHOPPERS ACROSS CHANNELS Variable Frequency
of purchases Quantity of purchases
Value of purchases
Failure experiencer x Post shock (DID)
-.025* (.01)
-.061* (.026)
-2.283** (.744)
Failure experiencer -.021** (.007)
-.029 (.018)
-.684 (.526)
Post shock .178*** (.007)
.233*** (.019)
27.191*** (.534)
Intercept .766*** (.005)
1.556*** (.013)
31.414*** (.378)
R squared .0039 .0010 .0181 Mean Y .84 1.64 44.07
Notes: N = 267,534. Robust standard errors clustered by shoppers are in parentheses; *** p < .001. DID = Difference-in-Differences.
vii
Table B5 ROBUSTNESS OF TABLE 3 RESULTS TO ALTERNATIVE MEASURES OF DIGITAL CHANNEL USE
BASED ON APP USE FREQUENCY BEFORE FAILURE Variable Frequency of
purchases Quantity of purchases
Value of purchases
Failure experiencer x Post shock (DID) -.244*** (.024)
-.467*** (.064)
-22.284*** (1.837)
DID x Value of past purchases .000*** (.000)
.000*** (.000)
.019*** (.001)
DID x Recency of purchases -.003*** (.000)
-.007*** (.001)
-.137*** (.015)
DID x Past app use frequency -.004 (.006)
-.034* (.017)
1.682*** (.476)
Value of past purchases .000*** (.000)
.001*** (.000)
.021*** (.000)
Recency of purchases .008*** (.000)
.016*** (.000)
.347*** (.007)
Past app use frequency .041*** (.001)
.080*** (.001)
1.535*** (.038)
Failure experiencer .088*** (.010)
.226*** (.027)
3.272*** (.788)
Post shock .182*** (.010)
.214*** (.027)
35.054*** (.766)
Intercept .616*** (.010)
1.059*** (.026)
22.264*** (.744)
R squared .2019 .1508 .1072 Mean Y .84 1.64 44.07
Notes: N = 267,534. Robust standard errors clustered by shoppers are in parentheses; Each moderator interacts with the difference-in-differences (DID) term failure experiencers x post shock; *** p < .001. The observations include those of shoppers with at least one purchase in the past for computing recency.
Table B6 ROBUSTNESS OF TABLE 3 RESULTS TO REGRESSION DISCONTINUITY STYLE ANALYSIS
Variable Frequency of purchases
Quantity of purchases
Value of purchases
Failure experiencer x Post shock (DID) -.045** (.016)
-.09* (.04)
-3.169** (1.167)
Failure experiencer -.04** (.012)
-.071* (.029)
-1.261 (.825)
Post shock .178*** (.015)
.231*** (.037)
26.385*** (1.075)
Intercept .759*** (.011)
1.538*** (.026)
31.287*** (.760)
R squared .0032 .0008 .0160
Mean Y .80 1.56 42.07 Notes: N = 198,432. Robust standard errors clustered by shoppers are in parentheses; *** p < .001. DID = Difference-in-Differences.
viii
WEB APPENDIX C ROBUSTNESS CHECK FOR TABLE 4 (BY CHANNEL) RESULTS
In this section, we present the results for robustness checks for the cross-channel estimation in Table 4 relating to (a) alternative models with covariates and using Poisson model (Tables C1-C2), (b) outliers (Table C3), (c) existing shoppers (Table C4), (d) alternative measures for prior use of digital channels (Table C5), and (e) regression-discontinuity style analysis (Table C6).
Table C1 ROBUSTNESS OF TABLE 4 RESULTS TO INCLUSION OF COVARIATES BY CHANNEL
Offline Online Variable Frequency of
purchases Quantity of purchases
Value of purchases
Frequency of purchases
Quantity of purchases
Value of purchases
Failure experiencer x Post shock (DID)
-.023* (.010)
-.059* (.026)
-1.967** (.739)
-.002 (.002)
-.003 (.004)
-.125 (.165)
Failure experiencer
-.015* (.007)
-.019 (.018)
-.431 (.523)
-.003* (.001)
-.006* (.003)
-.194 (.117)
Post shock .171*** (.007)
.222*** (.019)
25.462*** (.532)
.009*** (.001)
.016*** (.003)
1.720*** (.119)
Gender -.05*** (.011)
-.109*** (.028)
-3.304*** (.784)
-.001 (.002)
-.002 (.004)
-.063 (.175)
Loyalty program
-.165*** (.006)
-.404*** (.014)
-8.306*** (.403)
-.006*** (.001)
-.012*** (.002)
-.427*** (.090)
Intercept .775*** (.005)
1.620*** (.014)
32.260*** (.382)
.038*** (.001)
.058*** (.002)
1.640*** (.085)
R squared .0112 .0079 .0208 .0006 .0005 .0018 Mean Y .78 1.56 41.08 .04 .06 2.23 Notes: N = 273,378. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01. DID = Difference-in-Differences.
Table C2 DID POISSON MODEL RESULTS BY CHANNEL
Offline Online Variable Frequency
of purchases Quantity of purchases
Frequency of purchases
Quantity of purchases
Failure experiencer x Post shock (DID)
-.0209* (.0127)
-.0311*** (.016)
-.019 (.0482)
-.0184 (.0643)
Failure experiencer -.0253*** (.009)
-.0169*** (.0119)
-.0885** (.0358)
-.0995*** (.0469)
Post shock .2133*** (.0091)
.1401*** (.0113)
.213*** (.0337)
.2453*** (.046)
Intercept -.3363*** (.0065)
.3851*** (.0083)
-3.3261*** (.025)
-2.9257*** (.033)
Log pseudo-likelihood 368,831 -698,829 -46,197 -69,852 Mean Y .78 1.56 .04 .06
Notes: Robust standard errors in parentheses; *** p < .001, ** p < .01, * p < .05. N = 273,378. DID = Difference-in-Differences.
ix
Table C3
ROBUSTNESS OF TABLE 4 RESULTS TO OUTLIER SPENDERS Offline Online
Variable Frequency of purchases
Quantity of purchases
Value of purchases
Frequency of purchases
Quantity of purchases
Value of purchases
Failure experiencer x Post shock (DID)
-.021* (.010)
-.053* (.024)
-2.068** (.701)
-.001 (.002)
-.002 (.004)
-.071 (.156)
Failure experiencer
-.018** (.007)
-.026 (.017)
-.571 (.496)
-.003* (.001)
-.006* (.003)
-.171 (.110)
Post shock .175*** (.007)
.24*** (.017)
25.727***(.504)
.009*** (.001)
.016*** (.003)
1.712*** (.112)
Intercept .704*** (.005)
1.437*** (.012)
28.479***(.356)
.035*** (.001)
.052*** (.002)
1.428*** (.079)
R squared .0042 .0012 .0180 .0000 .0000 .0020 Mean Y .78 1.53 40.51 .04 .06 2.18 Notes: N = 272,706. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, * p < .05. DID = Difference-in-Differences.
Table C4 ROBUSTNESS OF TABLE 4 RESULTS TO EXISTING SHOPPERS BY CHANNEL
Offline Online Variable Frequency of
purchases Quantity of purchases
Value of purchases
Frequency of purchases
Quantity of purchases
Value of purchases
Failure experiencer x Post shock (DID)
-.024* (.010)
-.059* (.025)
-2.166** (.721)
-.002 (.002)
-.003 (.004)
-.117 (.161)
Failure experiencer
-.018* (.007)
-.024 (.018)
-.515 (.510)
-.003* (.001)
-.005 (.003)
-.169 (.114)
Post shock .170*** (.007)
.218*** (.018)
25.499*** (.518)
.008*** (.001)
.015*** (.003)
1.693*** (.115)
Intercept .730*** (.005)
1.501*** (.013)
29.882*** (.366)
.037*** (.001)
.055*** (.002)
1.532*** (.082)
R squared .0037 .0009 .0169 .0003 .0002 .0016 Mean Y .80 1.58 41.81 .04 .06 2.26
Notes: N = 267,534. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, * p < .05. DID = Difference-in-Differences.
x
Table C5 ROBUSTNESS OF TABLE 4 RESULTS TO ALTERNATIVE MEASURE OF DIGITAL CHANNEL USE
BASED ON APP USAGE FREQUENCY BEFORE FAILURE Offline Online
Variable Frequency of purchases
Quantity of purchases
Value of Purchases
Frequency of purchases
Quantity of purchases
Value of purchases
Failure experiencer x Post shock (DID)
-.216*** (.024)
-.415*** (.063)
-19.428*** (1.784)
-.029*** (.005)
-.052*** (.010)
-2.856*** (.425)
DID x Value of past purchases
.000*** (.000)
.000*** (.000)
.017*** (.001)
.000*** (.000)
.000*** (.000)
.001*** (.000)
DID x Recency of purchases
-.003*** (.000)
-.006*** (.001)
-.118*** (.015)
.000*** (.000)
.000*** (.000)
-.019*** (.004)
DID x Past app use frequency
-.008 (.006)
-.04* (.016)
1.146* (.462)
.005*** (.001)
.007* (.003)
.536*** (.110)
Value of past purchases
.000*** (.000)
.001*** (.000)
.021*** (.000)
.000*** (.000)
.000*** (.000)
.000*** (.000)
Recency of purchases
.008*** (.000)
.016*** (.000)
.335*** (.007)
.000*** (.000)
.001*** (.000)
.012*** (.002)
Past app use frequency
.037*** (.000)
.073*** (.001)
1.358*** (.037)
.003*** (.000)
.007*** (.000)
.177*** (.009)
Failure experiencer
.087*** (.010)
.222*** (.027)
3.240*** (.765)
.001 (.002)
.004 (.004)
.032 (.182)
Post shock .176*** (.010)
.202*** (.026)
32.959*** (.743)
.006** (.002)
.011* (.004)
2.096*** (.177)
Intercept .583*** (.010)
1.028*** (.025)
21.406*** (.723)
.034*** (.002)
.031*** (.004)
.859*** (.172)
R squared .1944 .1450 .1026 .0151 .0146 .0080
Mean Y .80 1.58 41.81 .04 .06 2.26 Notes: N = 267,534. Robust standard errors clustered by shoppers are in parentheses; *** p < .001. DID = Difference-in-Differences. The observations include those of shoppers with at least one purchase in the past for computing recency.
Table C6 ROBUSTNESS OF TABLE 4 RESULTS TO REGRESSION DISCONTINUITY STYLE ANALYSIS
Offline Online Variable Frequency of
purchases Quantity of purchases
Value of purchases
Frequency of purchases
Quantity of purchases
Value of purchases
Failure experiencer x Post shock (DID)
-.044** (.016)
-.087* (.040)
-3.105** (1.131)
-.001 (.003)
-.003 (.006)
-.064 (.254)
Failure experiencer -.034** (.011)
-.062* (.028)
-1.017 (.800)
-.006** (.002)
-.009* (.004)
-.244 (.179)
Post shock .172*** (.015)
.218*** (.036)
24.835***(1.042)
.006* (.003)
.013* (.005)
1.550*** (.234)
Intercept .720*** (.010)
1.482*** (.026)
29.704***(.736)
.039*** (.002)
.056*** (.004)
1.583*** (.165)
R squared .0031 .0007 .0150 .0002 .0002 .0014 Mean Y .76 1.51 39.94 .04 .05 2.12
Notes: N = 198,432. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, * p < .10. DID = Difference-in-Differences.
xi
WEB APPENDIX D
OTHER ROBUSTNESS CHECKS
Table D1 EFFECTS OF APP FAILURE ON AVERAGE VALUE OF PURCHASES EACH WEEK
Variable Estimate (Standard error)
Treat x Week -4 -.09 (.28)
Treat x Week -3 -.58* (.25)
Treat x Week -2 -.45 (.24)
Treat x Week -1 -.37 (.25)
Treat x Week 0 -.82* (.37)
Treat x Week 1 -1.70** (.56)
Treat x Week 2 -.581 (.31)
Treat x Week 3 -.531 (.31)
Treat x Week 4 -.64* (.29)
Intercept 13.19*** (.09)
Mean Y 15.92
Notes: Robust standard errors clustered by shoppers are in parentheses; week and individual fixed effects are included. *** p < .001, ** p < .01, * p < .05, 1 p < .1. N = 1,366,890. DID = Difference-in-Differences. Week 5 in the pre-failure period (Week -5) is the base week.
Table D2 RESULTS OF DID MODEL WITH STACKED ONLINE AND OFFLINE PURCHASES AND CHANNEL
DUMMIES Variable Frequency of
purchases Quantity of purchases
Value of purchases
Failure experiencer x Post shock (DID)
-.001 (.002)
-.003 (.003)
-.093 (.154)
DID x Channel dummy -.021** (.008)
-.052** (.019)
-1.994** (.674)
Failure experiencer -.003* (.001)
-.005* (.002)
-.167** (.064)
Post shock .009*** (.001)
.015*** (.002)
1.672*** (.113)
Channel dummy .678*** (.005)
1.416*** (.012)
27.755*** (.217)
Failure experiencer x Channel dummy
-.015* (.006)
-.020 (.017)
-.360 (.299)
Post shock x Channel dummy
.161*** (.006)
.206*** (.014)
23.602*** (.493)
Intercept .036*** (.001)
.054*** (.002)
1.500*** (.048)
R squared .1398 .0949 .0911 Mean Y .82 1.61 43.31
Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = 546,756. DID = Difference-in-Differences. Channel dummy is 1 for offline purchases and 0 for online purchases.
xii
Figure D1. PRE-PERIOD PURCHASE TRENDS FOR FAILURE EXPERIENCERS AND NON-EXPERIENCERS
(a) Past Frequency of Purchases
(b) Past Quantity of Purchases
(c) Past Proportion of Online Purchases
Note: The unit of X axis is number of days before the failure event.