how does mobile app failure affect purchases in …

1

HOW DOES MOBILE APP FAILURE AFFECT PURCHASES IN ONLINE AND OFFLINE CHANNELS?

Unnati Narang Venkatesh Shankar Sridhar Narayanan

December 2020 * Unnati Narang ([email protected]) is Assistant Professor of Marketing, University of Illinois, Urbana Champaign, Venkatesh Shankar ([email protected]) is Professor of Marketing and Coleman Chair in Marketing and Director of Research, Center for Retailing Studies at the Mays Business School, Texas A&M University, and Sridhar Narayanan ([email protected]) is an Associate Professor of Marketing at the Graduate School of Business, Stanford University. We thank the participants at the ISMS Marketing Science conference, the UTDFORMS conference, and research seminar participants at the University of California, Davis, the University of Toronto, the University of Illinois, Urbana Champaign, and the University of Texas at Austin for valuable comments.

2

Abstract Mobile devices account for a majority of transactions between shoppers and marketers. Branded retailer mobile apps have been shown to significantly increase purchases across channels. However, app service failures can lead to decreases in app usage, making app failure prevention and recovery critical for retailers. Does an app failure influence purchases in general and within the online channel in particular? Does it have any spillover effects across other channels? What potential mechanisms explain and what factors moderate these effects? We examine these questions empirically, employing a unique dataset from an omnichannel retailer. We leverage a natural experiment of exogenous systemwide failure shocks in this retailer’s mobile app and related data to examine the causal impact of app failures on purchases in all channels using a difference-in-differences approach. We investigate two potential mechanisms behind these effects – channel substitution and brand preference dilution. We also analyze shopper heterogeneity in the effects using a theoretically-driven moderator approach as well as a data-driven machine learning method. Our analysis reveals that although an app failure has a significant overall negative effect on shoppers’ frequency, quantity, and monetary value of purchases across channels, the effects are heterogeneous across channels and shoppers. Interestingly, the decreases in purchases across channels are driven by purchase reductions in brick-and-mortar stores and not in the online channel. A significant decrease in app engagement post failure explains the overall drop in purchases. Brand preference dilution after app failure explains the fall in store purchases, while channel substitution post failure explains the preservation of purchases in the online channel. Surprisingly, purchases rise for a small group of shoppers who were close to the retailer’s store at the time of app failure. Furthermore, shoppers with a higher monetary value of past purchases, and less recent purchases are less sensitive to app failures. The results suggest that app failures lead to an annual revenue loss of about $2.4-$3.4 million for the retailer in our data. About 47% shoppers contribute to about 70% of the loss. We outline targeted failure prevention and service recovery strategies that retailers could employ. Keywords: service failure, mobile marketing, mobile app, retailing, omnichannel, difference-in-differences, natural experiment, causal effects

1

INTRODUCTION

Mobile commerce has seen tremendous growth with mobile devices accounting for a majority of

interactions between shoppers and marketers. This growth has accelerated through the rapid

increase of smartphone penetration – about 3.2 billion people (41.5% of the global population)

used smartphones in 2019.1 Mobile applications (henceforth, apps) have emerged as an important

channel for retailers as they have been found to increase engagement and purchases across

channels (e.g., Kim et al. 2015; Narang and Shankar 2019; Xu et al. 2016).

While retailers have widely embraced mobile apps, there is little understanding about how

service failures in this channel affect shopper behavior. This issue is important because unlike

other channels, the mobile channel is highly vulnerable to failures. The diversity of mobile

operating systems (e.g., iOS, Android), devices (e.g., mobile phone and tablet), and versions of

hardware and software and their constant use across a variety of mobile networks often result in

app failures. Failures in a retailer’s mobile app have the potential to negatively affect shoppers’

engagement with the app and their shopping outcomes within the mobile channel. In addition,

app failures may have spillover effects across other channels due to both substitution of

purchases across channels and dilution of preference for the retailer brand. Understanding how

and why failures impact shoppers’ behavior across channels is important for retailers.

Preventing and recovering from app failures is critical for managers because more than 60%

of shoppers abandon an app after experiencing failure(s) (Dimensional Research 2015). In 2016,

app crashes were the leading cause of system failures, contributing 65% to all iOS failures

(Blancco 2016). About 2.6% of all app sessions result in a crash, suggesting about 1.5 billion app

failures across 60 billion app sessions annually (Computerworld 2014). Given the extent of these

1 Source: Statista report on smartphone penetration (https://tinyurl.com/hy2skfk) last accessed 18 November 2020.

2

app failures and their potential damage to firms’ relationships with customers, determining the

impact of app failures is important for formulating preventive and recovery strategies.

Despite the importance of app failures, not much is known about their impact on purchases.

While app crashes in a shopper’s mobile device have been shown to negatively influence app

engagement (e.g., restart time, browsing duration, and activity level, Shi et al. 2017), the

relationship between app failures and subsequent purchases has not been studied. Furthermore, a

large proportion of shoppers use both online (desktop website, mobile website, and mobile app)

and offline (brick-and-mortar) retail channels. However, we do not know much about the impact

of app failures on shopping outcomes across channels (spillover effects).

From a theoretical standpoint, the potential mechanisms behind such effects within and

across channels are important. How much of these effects arise due to channel substitution post

failure? What portion of the effects can be attributed to dilution of preference for the retailer’s

brand? Prior research has not addressed these interesting questions.

The effects of app failure may also differ across shoppers. Shoppers may be more or less

negatively impacted by failures depending on factors such as shoppers’ relationship with the firm

(Chandrashekaran et al. 2007; Goodman et al. 1995, Hess et al. 2003; Knox and van Oest 2014;

Ma et al. 2015) and shoppers’ prior use of the firm’s digital channels (Cleeren et al. 2013; Liu

and Shankar 2015; Shi et al. 2017). It is important for managers to better understand how the

effects of failure vary across shoppers so that they can devise targeted preventive and recovery

strategies. Yet not much is known about heterogeneity in the effects of app failure.

Our study fills these crucial gaps in the literature. We quantify and explain the impact of app

failures on managerially important outcomes, such as the frequency, quantity, and monetary

value of purchases in online and offline channels. We address four research questions:

3

• What are the effects of a service failure in a retailer’s mobile app on the frequency, quantity, and monetary value of subsequent purchases by the shoppers?

• What are the effects of a service failure in an app on purchases in the online and offline channels?

• What potential mechanisms explain the effects of an app service failure on purchases? • How do these effects vary across shoppers or what factors moderate these effects?

Estimation of the causal effects of app failures on shopping outcomes is challenging. It is

typically hard to do this using observational data due to the potential endogeneity of app failures.

This endogeneity may stem from an activity bias in that shoppers who use the app more

frequently are also more likely to experience failures than other shoppers. Therefore, failure-

experiencing shoppers may differ systematically from non-failure experiencers in their shopping

behavior, leading to potentially spurious correlations between failures and shopping behavior.

Panel data may not necessarily mitigate this issue because time-varying app usage/shopping

activity is potentially correlated with time-varying app failures for the same reason. That is,

shoppers are likely to engage more with the app when they are likely to purchase, potentially

leading to more failures than in periods when shoppers engage less with the app. Additionally,

the nature of activity on the app may be correlated with failures. For instance, a negative

correlation between failures and purchases may result from a greater incidence of failures on the

app’s purchase page than on other pages. Thus, it is hard to make the case that correlations

between app failures and shopping outcomes in observational data have a causal interpretation.

The gold standard among the methods available to uncover the causal impact of service

failures is a randomized field experiment. However, such an experiment would be impractical in

this context because a retailer will unlikely deliberately induce failures in an app even for a small

subset of its shoppers for ethical reasons. Alternatively, we can use an instrumental variable

approach to control for endogeneity. However, it is hard to come up with instrumental variables

that are valid and exhibit sufficient variation to address the endogeneity concerns in this context.

4

We overcome the estimation challenges and mitigate the potential endogeneity of app

failures using the novel features of a unique dataset from a large omnichannel retailer of video

games, consumer electronics and wireless service. We exploit a natural experiment of server

error-induced systemwide exogenous failures in the retailer’s mobile app to estimate the causal

effects of app failure. Conditional on signing in on the day of the failure, whether a user

experienced a failure or not was a function of whether they attempted to use the app during the

time window of the failure, which they could not have anticipated in advance. We take

advantage of the resulting quasi-randomness in incidences of failures to estimate the causal

effects of failures on the mobile app. We employ a difference-in-differences (DID) approach that

compares the pre- and post- failure outcomes for the failure experiencers with those of failure

non-experiencers to estimate the effects of the app failure. Through a series of robustness checks,

we confirm that failure non-experiencers act as a valid control for failure experiencers, providing

us the exogenous variation to find causal answers to our research questions.

We investigate the potential mechanisms and moderators of the effects of failures on

shopping behavior by exploiting the panel nature of our dataset. We test for the moderating

effects of factors such as relationship with the firm and prior digital channel use on the effects of

service failures. These factors have been explored for services in general (e.g., Hansen et al.

2018; Ma et al. 2015) but not in the digital or mobile app contexts. In addition, we recover the

heterogeneity of effects at the individual level using data-driven machine learning methods.

Our results show that app failures have a significant overall negative effect on shoppers’

frequency, quantity, and monetary value of purchases across channels, but the effects are

heterogeneous across channels and shoppers. A significant decrease in app engagement (e.g.,

number of app sessions, dwell time, and number of app features used) post failure explains the

5

overall drop in purchases. Interestingly, the overall decreases in purchases across channels are

driven by purchase reductions in stores, rather than in the online channel. The fall in store

purchases after app failure is consistent with brand preference dilution, while the preservation of

purchases in the online channel is consistent with channel substitution. Shoppers experiencing

the failure when they are farther away from purchase (e.g., browsing product information)

experience greater negative effects of a failure than those closer to purchase (e.g., checking out

in the app). Surprisingly, the basket size and value of purchases rise for a small group of

shoppers who were close to the retailer’s store at the time of app failure. Furthermore, shoppers

with a higher monetary value of past purchases and less recent purchases are less sensitive to app

failures. Finally, most shoppers (96%) react negatively to failures, but about 47% of these

shoppers contribute to about 70% of the losses in annual revenues that amount to $2.4-$3.4

million.

In the remainder of the paper, we first discuss the literature related to service failures, cross-

channel spillovers, and consumer interaction with mobile apps. Next, we discuss the data in

detail, summarizing them and highlighting their unique features. Subsequently, we describe our

empirical strategy, layout and test the key identification strategy, and conduct our empirical

analysis of the effects of app failures. We explore the potential mechanisms behind the results.

We then conduct robustness checks to rule out alternative explanations. We conclude by

discussing the implications of our results for managers.

BACKGROUND AND RELATED LITERATURE

Services Marketing and Service Failures

The nature of services has evolved considerably since academics first started to study services

marketing. For long, the production and consumption of services remained inseparable primarily

6

because services were performed by humans. However, of late, technology-enabled services

have risen in importance, leading to two important shifts (Dotzel et al. 2013). First, services that

can be delivered without human or interpersonal interaction have grown tremendously. Online

and mobile retailing no longer require shoppers to interact with human associates to make

purchases. Second, closely related to this idea is the fact that services are increasingly powered

by technologies such as mobile apps that allow anytime-anywhere access and convenience.

With growing reliance on technologies for service delivery and the complexity of the

technology environment in which these services are delivered, service failures are attracting

greater attention. A service failure can be defined as service performance that falls below

customer expectations (Hoffman and Bateson 1997). Service failures are widespread and are

expensive to mend. Service failures resulting from deviations between expected and actual

performance damage customer satisfaction and brand preference (Smith and Bolton 1998). Post-

failure satisfaction tends to be lower even after a successful recovery and is further negatively

impacted by the severity of the initial failure (Andreassen 1999; McCollough et al. 2000). In

interpersonal service encounters, human interactions and employee behaviors influence both

failure effect and recovery (Bitner et al. 1990; Meuter et al. 2000). In technology-based

encounters, such as those in e-tailing and with self-service technologies (e.g., automated teller

machines [ATMs]), the opportunity for human interaction is typically small after experiencing

failure (Forbes et al. 2005; Forbes 2008). However, there may be significant heterogeneity in

how consumers react to service failures (Halbheer et al. 2018).

In the mobile context, specifically for mobile apps, it is difficult to predict the direction and

extent of the impact of a service failure on shopping outcomes. First, mobile apps are accessible

at any time and in any location through an individual’s mobile device. On the one hand, because

7

a shopper can tap, interact, engage, or transact multiple times at little additional cost on a mobile

app, the shopper may treat any one service failure as acceptable without significantly altering her

subsequent shopping outcomes. Such an experience differs from that with a self-service

technological device such as an ATM, which may need the shopper to travel to a specific

location or incur other hassle costs that may not exist in the mobile app context. On the other

hand, the costs of switching to a competitor are also much lower in the mobile app context,

where a typical shopper uses and compares multiple apps. Thus, a service failure in any one app

may aggravate the shopper’s frustration with the app, leading to strong negative effects on

outcomes such as purchases from the relevant app provider.

Second, a mobile app is one of the many touchpoints available to shoppers in today’s

omnichannel shopping environment. Thus, a shopper who experiences a failure in the app could

move to the web-based channel or even the offline or store channel. In such cases, the impact of

a failure on the app could be zero or even positive (if the switch to the other channel leads to

greater engagement of the shopper with the retailer). By contrast, if the channels act as

complements (e.g., if the shopper uses one channel for researching products and another for

purchasing) or if the failure impacts the preference for retailer brand, a failure in one channel

could impede the shopper’s engagement in other channels. Thus, it is difficult to predict the

effects of app failure, in particular, about how they might spill over to other channels.

Channel Choice and Channel Migration

A shopper’s experience in one channel can influence her behavior in other channels. Prior

research on cross-channel effects is mixed, showing both substitution and complementarity

effects, leading to positive and negative synergies between channels (e.g., Avery et al. 2012;

Pauwels and Neslin 2015). The relative benefits of channels determine whether shoppers

8

continue using existing channels or switch to a new channel (Ansari et al. 2008; Chintagunta et

al. 2012). When a bricks-and-clicks retailer opens an offline store or an online-first retailer opens

an offline showroom, its offline presence drives sales in online stores (Bell et al. 2018; Wang and

Goldfarb 2017).2 This is particularly true for shoppers in areas with low brand presence prior to

store opening and for shoppers with an acute need for the product. However, the local shoppers

may switch from purchasing online to offline after an offline store opens, even becoming less

sensitive to online discounts (Forman et al. 2009). In the long run, the store channel shares a

complementary relationship with the Internet and catalog channels (Avery et al. 2012).

While the relative benefits of one channel may lead shoppers to buy more in other channels,

the costs associated with one channel may also have implications for purchases beyond that

channel. In a truly integrated omnichannel retailing environment, the distinctions between

physical and online channels blur, with the online channel representing a showroom without

walls (Brynjolfsson et al. 2013). Mobile technologies are at the forefront of these shifts. More

than 80% of shoppers use a mobile device while shopping even inside a store (Google M/A/R/C

Study 2013). As a result, if there are substantial costs associated with using a mobile channel

(e.g., those induced by app failures), such costs may spill over to other channels. If shoppers use

the different channels in complementary ways, the disruption of one of those channels could

negatively impact their engagement with the other channels as well. However, if shoppers treat

the channels as substitutes, failures in one channel may drive the shoppers to purchase in another

channel. If an app failure dilutes shoppers’ preference for the retailer brand, it may lead to

negative consequences across channels. Overall, the direction of the effect of app failures on

2 A bricks-and-clicks retailer is a retailer with both offline (“bricks”) and online (“clicks”) presence.

9

outcomes in other channels such as in brick-and-mortar stores and online channels depends on

which of these competing and potentially co-existing mechanisms is dominant.

Mobile Apps

The nascent but evolving research in mobile apps shows positive effects of mobile app channel

introduction and use on engagement and purchases in other channels (Kim et al. 2015; Narang

and Shankar 2019; Xu et al. 2016) and for coupon redemptions (Andrews et al. 2015; Fong et al.

2015; Ghose et al. 2019) under different contingencies.

To our knowledge, only one study has examined crashes in a mobile app on shoppers’ app

use. Shi et al. (2017) find that while crashes have a negative impact on future engagement with

the app, this effect is lower for those with greater prior usage experience and for less persistent

crashes. However, while they look at subsequent engagement of the shoppers with the mobile

app, they do not examine purchases. Thus, our research adds to Shi et al. (2017) in several ways.

First, we focus on estimating the causal effects of failure. To this end, we exploit the random

variation in failures induced by systemwide failures. Second, we quantify the value of app

failure’s effects on subsequent purchases. The outcomes we study include the frequency,

quantity, and value of purchases, while the key outcome in that study is app engagement. Third,

we examine the cross-channel effects of mobile app failures, including in physical stores, while

Shi et al. (2017) study subsequent engagement with the app provider only within the app.

Finally, we explore the mechanisms behind the effects of failure , and examine the moderating

effects of relationship with the retailer and prior digital and heterogeneity in shoppers’ sensitivity

to failures using a machine learning approach.

To summarize, our study (1) focuses on the effect of app failure on purchases, (2) quantifies

the effects on multiple outcomes such as frequency, quantity, and monetary value of purchases,

10

(3) addresses the outcomes in each channel and across all channels (substitution and

complementary effects), and (4) uncovers the mechanisms behind and moderators of the effects

of app failure on shopping outcomes and heterogeneity in effects across shoppers. All these

characteristics are novel, contributing to the research streams on service marketing, channel

choice, and mobile apps.

RESEARCH SETTING AND DATA

Research Setting

We obtained the dataset for our empirical analysis from a large U.S.-based retailer. In the

following paragraphs, we describe the retailer, the mobile app, and the channel sales mix.

The retailer sells a variety of products, including software such as video games and hardware

such as video game consoles and controllers, downloadable content, consumer electronics and

wireless services with 32 million customers. The gaming industry is large ($99.6 billion in

annual revenues), and the retailer is a major player in this industry, offering us a rich setting. The

retailer has a large offline presence, and in this respect, is similar to Walmart, PetSmart, or any

other brick-and-mortar chain with an omnichannel strategy. The retailer’s primary channel is its

store network comprising 4,175 brick-and-mortar stores across the U.S. Additionally, it has a

large ecommerce website, and the mobile app that is the focus of our study.

The app allows shoppers to browse the retailer’s product catalog, get deals, order online

through a mobile browser, locate nearby stores, as well as make purchases through the app itself.

The app is typical of mobile apps of large retailers (e.g., PetSmart, Costco) in features and

consumer interactions. The growth in the adoption of the app has also been similar to that of

many large retailers. App adoption rate started small and grew over time. Figure 1 shows some

screenshots from the app.

11

Figure 1

APP SCREENSHOTS

The online and offline channel sales mix of the retailer in our data is typical of most large

retailers. About 76% of the total sales for the top 100 largest retailers in the U.S. are from similar

retailers with a store network of 1,000 or more stores (National Retail Federation 2018). Most

large retailers have a predominant brick-and-mortar presence. For these retailers, while most of

the transactions and revenues come from the offline channels, online sales exhibit rapid growth.

For example, Walmart’s online revenues constitute 3.8% of all revenues, 1.3% of all PetSmart’s

sales come from the online channel, Home Depot generates 6.8% of all revenues from

ecommerce, and 5.4% of Target’s sales are through the online channel.3 For the retailer in our

data, online sales comprised 10.2% of overall revenues, somewhat higher than that for similar

large retailers. Furthermore, about 26% of the shoppers bought online in the 12 months before

the failure event we study. The retailer’s online sales displayed a 13% annual average growth in

the last five years, similar to these retailers who also exhibited double digit growth (Barron’s

3 Source: eMarketer Retail, https://retail-index.emarketer.com/

12

2018). Its annual online sales revenues are also substantial at $1.1 billion. Therefore, our

research context offers a rich setting to examine cross-channel effects of a mobile app failure.

Data and Sample

We study the impact of a systemwide failure that occurred on April 11, 2018.4 The firm provided

us with mobile app use data and transactional data across all channels for all the app users who

logged into the app on the failure day. The online channel represents purchases at the retailer’s

website, including those using the mobile browser. Nested within the app use data are data on

events that shoppers experience, along with their timestamps. The mobile dataset recorded the

app failure event as ‘server error.’ Thus, this event represents an exogenous app breakdown, and

the data allow us to identify shoppers who logged in to experience the systemwide app failure.

Table 1 provides the descriptive statistics for the variables of interest. Over a period of 14

days pre- and post- failure, shoppers make an average of a little less than one purchase

comprising about 1.6 items for a value of about $43. In the 12 months before failure, shoppers

make purchases worth $623 and on average, buy .66 times in the online channel. Overall, 52% of

the shoppers experience the failure during our focal failure event.

Table 1 SUMMARY STATISTICS

Variable Mean Std. dev.

Frequency of purchases .82 1.34

Quantity of purchases 1.61 3.32

Value of purchases ($) 43.31 96.42

App failure/Failure experiencer .52 .50

Recency of past purchases (in days) -45.68 68.83

Value of past purchases ($) 629.60 699.38

Frequency of past online purchases .66 1.97 Notes: These statistics of the variables are over pre- and post- 14 days of the failure. The past purchases are computed over a one-year period. N = 273,378.

4 We verified that this failure was systemwide and exogenous through our conversations with company executives.

13

EMPIRICAL STRATEGY

Overall Empirical Strategy

As outlined earlier, we leverage the exogenous systemwide shock to estimate the causal effect of

app failure on shopping outcomes. The main idea behind our empirical approach is that

conditional on the attempted usage of the app on the day of the failure, the experience of a failure

by a specific shopper is random. We examine this assumption in the data by testing for balance

between shoppers who experience a failure and those who do not, using a set of pre-failure

variables. We find no systematic difference in these variables between shoppers who

experienced failures and those who did not, supporting our identification strategy. To determine

the treatment effect of a failure, we conduct a DID analysis, comparing the post-failure behaviors

with the pre-failure behaviors of shoppers who logged in on the day of the failure and

experienced it (akin to a treatment group) relative to those who logged in on that day but did not

experience the failure (akin to a control group).

To analyze the treatment effects within and across channels, we repeat this analysis with the

same outcome variables separately for the offline and online channel. To understand the

underlying mechanisms for the effects, we examine two explanations, brand preference dilution

and channel substitution, using the data on shoppers’ app engagement, closeness to purchase,

location at the time of failure, time to next purchase, and shipping costs to check for consistency

with these mechanisms. To analyze heterogeneity in treatment effects, we first perform a

moderator analysis using a priori factors identified in the literature such as prior relationship

strength and digital channel use, followed by a data driven machine learning (causal forest)

approach to fully explore all sources of heterogeneity across shoppers. Finally, we carry out

multiple robustness checks.

14

Exogeneity of Failure Shock

To verify that there is no systematic difference between shoppers who experience the failure

shock and those who do not, we examine two types of evidence. First, we present plots of the

behavioral trends in shopping for both failure-experiencers and non-experiencers for the failure

shock in the 14 days before the app failure. Figure 2 depicts the monetary value of daily

purchases by those who experienced the failure and those who did not. The purchase trends in

the pre-period are parallel for the two groups (p > .10), providing us assurance that these

shoppers do not systematically differ across the two groups. The trends are similar for the

frequency and quantity of purchases, and the proportion of online purchases (see Web Appendix

Figure D1).

Figure 2 COMPARISON OF FAILURE-EXPERIENCERS’ AND NON-EXPERIENCERS’ PURCHASES 14 DAYS

BEFORE FAILURE

Note: The red line represents failure experiencers, while the solid black line represents the failure non-experiencers.

Second, we compare the failure experiencers with non-experiencers across shopping

behaviors, such as recency of purchases and frequency of past online purchases (see Figure 3)

and past app usage sessions (see Figures 4 and 5). We also compare their observed demographic

variables, such as gender and membership in loyalty program. We do not find any significant

differences in these variables across the groups.

15

Figure 3 COMPARISON OF FAILURE-EXPERIENCERS AND NON-EXPERIENCERS

Note: Loyalty program level represents whether shoppers were enrolled (=1) or not (=0) in an advanced reward program.

Figure 4 PAST DAILY AVERAGE APP SESSION TRENDS OF FAILURE EXPERIENCERS VS. NON-

EXPERIENCERS

Figure 5

PAST DAILY AVERAGE NON-PURCHASE RELATED APP SESSION TRENDS OF FAILURE EXPERIENCERS VS. NON-EXPERIENCERS

Note: Non- purchase-related app sessions involve browsing pages whose actions are farther from purchase--such as browsing products or obtaining store related information. To summarize, we find no systematic differences between the failure experiencers and those who

do not experience failures in either their trends of outcomes before the failure event, or in other

35.17%

86.49%

1.53

0.6234.49%

84.39%

1.51

0.70

0

1

2

3

4

5

Gender (female) Loyalty program Recency of past purchase/30 Past online purchase frequency

Treated Control

16

variables that we observe prior to the event. This pre-trend analysis gives us confidence in the

validity of our empirical strategy.

Econometric Model and Identification As described in the previous section, we estimate the effects of app failure on shopping outcomes

by relying on a quasi-experimental research design with a DID approach (e.g., Angrist and

Pischke 2009). Specifically, we leverage a systemwide failure shock and compare app users who

experience this shock with those who do not, given that they accessed the app on the day of the

failure.

Our two-period linear DID regression takes the following form:

(1) 𝑌!" =𝛼# +𝛼$𝐹! +𝛼%𝑃" +𝛼&𝐹!𝑃" +𝜗!"

where i is shopper, t is time period (pre- or post- failure time period), and Y is the outcome

variable (frequency, quantity, monetary value), F is a dummy variable denoting treatment (1 if

shopper i experienced the app failure and 0 otherwise), P is a dummy variable denoting the

period (1 for the period after the systemwide app failure and 0 otherwise), α is a coefficient

vector, and ϑ is an error term. We cluster standard errors at the shopper level, following Bertrand

et al. (2004). The coefficient of FiPt, i.e., 𝛼&, is the treatment effect of the app failure.5

The assumptions underlying the identification of this treatment effect are: (1) the failure is

random conditional on a shopper logging into the app during the time window of the failure

shock and (2) the change in outcomes for the non-failure experiencing app users is a valid

counterfactual for the change in outcomes that would have been observed for failure-

experiencing app users in the absence of the failure.

5 Because we analyze the short-term effect of a service failure (14 and 30 days), we do not have an adequate number of observations per shopper post failure for us to estimate shopper fixed effects in our analysis.

17

EMPIRICAL ANALYSIS RESULTS

Relationship between App Failures and Purchases We first examine the overall differences in post-failure behaviors between shoppers who

experienced failures and those who did not using model-free evidence 14 days pre and post

failure. We choose a 14-day window period because this two-week period is close to the mean

interpurchase time in our dataset of 11 days and will equally include any “day of the week”

effects in shopping.6

Table 2 reports the raw comparisons of post-failure vs. pre-failure purchase outcome

variables for both failure experiencers (70,568 treated) and non-experiencers (66,121 control)

among the set of consumers who accessed the app on the day of the failure. We find that post-

failure, shoppers who experienced the systemwide failure had .04 (p < .001) lower purchase

frequency, .07 (p < .001) lower purchase quantity, and $2.42 (p < .001) lower monetary value

than shoppers who did not experience the failure. A simple comparison of shopping outcomes

across the two groups shows that the average monetary value of purchases increased by 81.8%

($30.41 to $55.28) for failure-experiencers, while it increased by 87.6% ($30.75 to $57.70) for

non-failure experiencers post failure relative to the pre period (p < .001).7 Given our

identification strategy, the diminished growth in the monetary value of purchases for failure

experiencers relative to non-experiencers comes from the exogenous failure shock.

6 We also estimated a model with dynamic treatment effects for a longer period of four weeks pre- and post- the failure shock and found similar effects (see Figure 6 and Table D1). 7 Increasing sales trend between the pre- and post- period for both the groups is partially due to the April 19 weekend in the post period that witnessed the release of a new game.

18

Table 2 MODEL-FREE EVIDENCE: MEANS OF OUTCOME VARIABLES FOR TREATED AND CONTROL GROUPS

Variable Treated pre period

Treated post period

Control pre period

Control post period

Frequency of purchases .74 .89 .75 .93 Quantity of purchases 1.52 1.69 1.52 1.76 Value of purchases ($) 30.41 55.28 30.75 57.70 Frequency of purchases – Online .03 .04 .03 .04 Quantity of purchases – Online .05 .06 .05 .07 Value of purchases – Online ($) 1.34 2.93 1.50 3.17 Frequency of purchases – Offline .70 .85 .71 .88 Quantity of purchases – Offline 1.47 1.63 1.47 1.69 Value of purchases – Offline ($) 29.07 52.35 29.25 54.53

Notes: These statistics are based on pre- and post- 14 days of the failures. N = 273,378.

Main Diff-in-Diff Model Results

The results from the DID model in Table 3 show a negative and significant effect of app failure

on the frequency (𝛼& = -.024, p < .01), quantity (𝛼& = -.057, p < .01), and monetary value of

purchases (𝛼& = -2.181, p < .01) across channels. Relative to the pre-period for the control group,

the treated group experiences a decline in frequency of 3.20% (p < .01), quantity of 3.74% (p

< .01), and monetary value of 7.1% (p < .01).8

Table 3 DID MODEL RESULTS OF FAILURE SHOCKS FOR PURCHASES ACROSS CHANNELS

Variable Frequency of purchases

Quantity of purchases

Value of purchases

Failure experiencer x Post shock (DID)

-.024** (.008)

-.057** (.020)

-2.181** (.681)

Failure experiencer -.021** (.007)

-.030 (.018)

-.694* (.302)

Post shock .178*** (.006)

.236*** (.014)

26.947***(.497)

Intercept .750*** (.005)

1.523*** (.012)

3.755*** (.219)

R squared .004 .001 .018 Effect size -3.20% -3.74% -7.09% Mean Y .82 1.61 43.31

Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = 273,378. DID = Difference-in-Differences.

8 We calculate the percentage change by dividing the treatment coefficient by the intercept. For instance, the treatment coefficient for value of purchases (2.18) divided by intercept (30.76) amounts to a 7.1% change.

19

Next, we examine the channel spillover effects of app failures in greater depth. We split the

total value of purchases into offline and online purchases. Table 4 reports the results for these

alternative channel-based dependent variables. There is a negative and significant effect of app

failure on the frequency (𝛼& = -.02, p < .01), quantity (𝛼& = -.05, p < .01), and monetary value of

purchases (𝛼& = -2.09, p < .01) in the offline channel. Interestingly, we do not find a significant

(p > .10) effect of app failure on any of the purchase outcomes in the online channel. Because

there is no corresponding increase in the online channel and because the overall purchases drop,

we conclude that the decreases in overall purchases across channels are largely due to declines in

in-store purchases.

Table 4 DID MODEL RESULTS OF FAILURE SHOCKS FOR PURCHASES BY CHANNEL

Offline Online Variable Frequency of

purchases Quantity of purchases

Value of purchases

Frequency of purchases


Value of Purchases


-.022** (.008)

-.055** (.019)

-2.088** (.660)

-.002 (.002)

-.002 (.003)

-.093 (.154)

Failure experiencer

-.018** (.006)

-.025 (.017)

-.527 (.293)

-.003* (.001)

-.005* (.002)

-.167** (.064)

Post shock .170*** (.005)

.221*** (.014)

25.275*** (.482)

.009*** (.001)

.015*** (.002)

1.672*** (.113)

Intercept .714*** (.005)

1.470*** (.012)

29.255*** (.213)

.036*** (.001)

.054*** (.002)

1.500*** (.048)

R squared .0038 .0001 .0169 .0016 .0002 .0003 Effect size -3.08% -3.74% -7.14% - - - Mean Y .78 1.56 41.08 .04 .06 2.23 Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = 273,378. DID = Difference-in-Differences.

Mechanisms Behind the Effects of Failures on Shopping Outcomes

We now provide descriptive evidence for the potential mechanisms behind the results. The

overall negative effect of app failure on shopping outcomes across channels could be due to

decreases in intermediate outcomes such as shoppers’ engagement after failure. To explore this

20

possibility, we examine the effect of app failure on app engagement variables such as the number

of app sessions, the average dwell time per session, and the average number of app features used

in each session. The results of the corresponding DID model appear in Table 5. The treatment

effect of failure for each of the three variables is negative and significant (p < .001), suggesting

that app failure is associated with diminishing app engagement.

Table 5 DID MODEL RESULTS FOR APP ENGAGEMENT VARIABLES

Variable No. of app sessions

Average dwell time per session

Average no. of app features used


-.689*** (.005)

-7.444*** (.072)

-4.678*** (.024)

Failure experiencer .651*** (.005)

7.041*** (.065)

4.508*** (.021)

Post shock -.624*** (.003)

-3.525*** (.047)

-2.558*** (.016)

Intercept .727*** (.003)

4.654*** (.040)

3.067*** (.013)

R squared .4211 .1833 .4843

Mean Y .57 4.61 2.91 Notes: Robust standard errors clustered by shoppers are in parentheses; the app engagement variables are measured 5 hours pre- and post- failure; *** p < .001, ** p < .01, * p < .05. N = 273,378. DID = Difference-in-Differences.

The differential effect of app failure across the channels could be explained by the co-

occurrence of two countervailing forces: channel substitution and brand preference dilution. The

channel substitution effect occurs when app failure experiencers move to the mobile web

browser, the desktop website, or the physical store to complete their intended purchase. If

shoppers switch channels to complete their intended purchase, we should not observe negative

effects of the failure in the channels of their subsequent purchases. We may potentially see even

positive effects if the switch to the other channel leads to greater purchases in that channel than

would have occurred in the online channel. Brand preference dilution happens when app failure

experiencers get annoyed or dissatisfied with the retailer and lower their future purchases overall,

including in the store. It is possible that channel substitution effect and brand preference dilution

21

operate when shoppers experience the app failure at different stages of the purchase funnel.

Shoppers who are close to purchase at the time of app failure may quickly switch channels and

complete their purchase through the mobile or desktop website forms of the online channel.

However, shoppers who are far from purchase when the app fails may prefer the retailer brand

less and buy less than what they had planned to in the future, perhaps because they switched to

competing retailers instead.

To explore the role of stage in the purchase funnel in explaining the differential effects of app

failure, we first examine the effects of app failure across shoppers based on whether they are

close to or far from purchase at the time of failure. For this analysis, we utilize information in the

data about the type of page on which the shopper was when the failure occurred. Table 6 reports

the DID model results when the app failure occurred on purchase related and non-purchase

related pages. Purchase-related pages in an app involve pages that are closer to purchase, such as

those relating to adding a product to shopping cart, clicking checkout, or making payments. In

contrast, non- purchase-related pages involve pages whose actions are farther from purchase,

such as browsing products or obtaining store related information. The effect of app failure is

negative and significant (p < .001) on all the outcome variables for shoppers who experience

failure on a non-purchase related page than for shoppers who experience failure on a purchase

related page. Shoppers who already have a strong purchase intent and are on a purchase-related

page right before the failure are not as negatively affected as those without a strong purchase

intent or on a non-purchase related page.

22

Table 6. DID MODEL RESULTS FOR FAILURES OCCURRING ON PURCHASE AND NON-PURCHASE RELATED PAGES

Failure on purchase related page Failure on non-purchase related page Variable Frequency of


Value of purchases



Value of purchases


.000 (.013)

-.016 (.038)

.907 (1.195)

-.053*** (.009)

-.108*** (.022)

-4.627*** (.763)

Failure experiencer

-.019 (.011)

-.016 (.034)

-.246 (.52)

-.041*** (.007)

-.075*** (.019)

-1.208*** (.341)

Post shock .178*** (.006)

.236*** (.014)

26.947*** (.497)

.178*** (.006)

.236*** (.014)

26.947*** (.497)

Intercept .750*** (.005)

1.523*** (.012)

30.755*** (.219)

.750*** (.005)

1.523*** (.012)

30.755*** (.219)

R squared .004 .001 .019 .004 .001 .018 Mean Y .836 1.637 44.270 .813 1.591 42.850 Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. DID = Difference-in-Differences. N = 160,662 for failure on purchase related page. N= 217,418 for failure on non-purchase related page.

To further explore the role of the purchase funnel, we compare the change in the value of

purchases between the post and the pre app failure time periods for two groups of shoppers, close

to and far from purchase based on a median split of re-login attempts during the failure window.

The median number of attempts is three. The negative effect of failure for shoppers who make

greater re-login attempts is lower (Value of purchases(post-pre, high attempt) = 28.03, Value of

purchases(post-pre, control) = 26.95, p > .01) than for shoppers who make fewer re-login attempts

(Value of purchases(post-pre, low attempt) = 20.33, Value of purchases(post-pre, control) = 26.95, p < .001).

The group of shoppers who are close to purchase at the time of app failure are likely to

repeatedly attempt to re-login during the failure duration to complete their intended purchase.

Such shoppers may eventually make the purchase in another channel, resulting in channel

substitution. However, the group of shoppers who are far from purchase at the time of failure,

make fewer attempts to log back during the failure time window. A greater negative effect of app

failure for such shoppers may be due to brand preference dilution.

23

Failure-experiencers who were close to a purchase or had purchase intent, would have had to

determine whether to complete the transaction, and if so, whether to do it online or offline. For

shoppers who typically buy online, the cost of going to the retailer’s website to complete a

purchase interrupted by the app failure is smaller than that of going to the store to complete the

purchase. Therefore, these shoppers will likely complete the transaction online and not exhibit

any significant decrease in shopping outcomes in the online channel post failure. Thus, channel

substitution effect likely explains the insignificant effects of app failure in the online channel. By

contrast, shoppers who typically buy in the retailer’s brick-and-mortar stores and who experience

the app failure, will likely have a diminished perception of the retailer with fewer incentives to

buy from the stores in the future. Thus, brand preference dilution effect may prevail for these

shoppers after app failure. This effect is due to a negative spillover from the app channel to the

offline channel for shoppers experiencing the failure even if they are primarily offline shoppers.

Indeed, a negative message or experience can have an adverse spillover effect on attributes or

contexts outside the realm of the message or experience (Ahluwalia et al. 2001).

To further explore channel substitution toward the online channel, we examine the time

elapsed between the occurrence of the failure and subsequent purchase in the online channel.

Failure experiencers’ inter-purchase time online (Meantreated = 162.8 hours) is much shorter than

non-experiencers’ (Meancontrol = 180.7 hours) (p = .003). This result further suggests that after an

app failure, shoppers look to complete their intended purchases in the online channel.

Next, to understand channel substitution toward the offline channel, we examine the effect of

app failure for shoppers who were geographically close to a physical store at the time of failure.

Shoppers who are closer to a store when they experience the app failure could more easily

complete their purchase in the store than shoppers farther from a store. Table 7 reports the DID

24

model for the subsample of shoppers located within two miles of the retailer’s store at the time of

failure. The results show that shoppers closer to the store are not negatively affected by the

failure (p < .05). Rather surprisingly, both the basket size and the monetary value of purchases

for shoppers close to a store are significantly higher after app failure (p < .05). This result

suggests that shoppers who experience a failure close to or at a physical store end up buying

additional items in the store. Thus, app failure has an unintended positive effect on such

shoppers. An implication is that channel substitution to a store leads to more purchases, but that

channel substitution is less likely for shoppers who are farther from the store. However, the

proportion of shoppers close to the store at the time of failure is very small (2.4%), so the overall

effects of app failure on offline purchases and all purchases are still negative.

Table 7

DID MODEL RESULTS FOR VALUE OF PURCHASES AND BASKET SIZE BY CHANNEL FOR SHOPPERS CLOSE TO A STORE (< 2 MILES) AT THE TIME OF FAILURE

Offline Online Variable Value of

purchases Basket size Value of

purchases Basket size


13.542* (5.307)

.134* (.058)

.885 (1.178)

.023 (.020)

Failure experiencer 2.419 (2.171)

.061 (.048)

-1.096* (.479)

-.012 (.013)

Post shock 37.833*** (3.150)

.109** (.036)

2.382** (.882)

.019 (.012)

Intercept 32.458*** (1.336)

.846*** (.031)

2.251*** (.390)

.059*** (.009)

R squared .0395 .0064 .0027 .0012 Mean Y 55.00 .95 3.18 .07

Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = 6,572. DID = Difference-in-Differences. Two miles is the median distance from store at the time of failure.

To further analyze the role of distance to the store at the time of failure, we present the

contrast analysis between shoppers who were less than two miles and those who were greater

than two miles from the nearest store at the time of failure in Table 8. The basket sizes of these

groups of shoppers do not differ post failure. However, shoppers closer to the store spend more

25

than those farther from the store post failure, suggesting that the app failure is associated with

channel substitution in purchases for shoppers closer to the store.

Table 8 CONTRAST ANALYSIS BASED ON DISTANCE TO STORE AT THE TIME OF FAILURE FOR FAILURE

EXPERIENCERS Variable Offline value of

purchases Offline basket size

Close to store x Post shock (DID)

14.130* (5.707)

.083 (.065)

Close to store 2.011 (2.357)

.051 (.051)

Post shock 40.511*** (3.675)

.159** (.046)

Intercept 34.021*** (1.593)

.855*** (.036)

R squared .0432 .0064 Mean Y 54.96 .98

Note: N = 5,650. Closeness to store is defined using the median distance of 2 miles. There are 1,298 failure-experiencers within 2 miles of the store at the time of failure and 1,527 failure-experiences who are 2 miles or farther from the store among those who opt-in for location sharing. *** p < .001, ** p < .01, * p < .05.

Finally, to better understand how channel substitution and brand preference dilution effects

may act on different failure-experiencing shopper groups purchasing in different channels, we

compare the effects of app failure on orders above the free shipping cost threshold value ($35) in

the online and offline channels. Failure experiencers who intended to order items valued above

the threshold can quickly substitute the app channel with the Web channel without additional

cost, so we expect the app failure to have little effect on their frequency of online purchases. The

results of a DID model for the frequency of online and offline purchases above the free shipping

threshold appear in Table 9. Indeed, app failure has no significant (p > .10) effect in the online

channel but a negative and significant (p < .05) effect in the offline channel on frequency of

purchases, suggesting offline shoppers significantly lower their preference and purchases after

the app failure. Thus, channel substitution appears to explain the null effect of app failure online,

while brand preference dilution seems to account for the negative effect of failure offline.

26

Table 9 DID MODEL RESULTS FOR THE AVERAGE NUMBER OF ORDERS ABOVE FREE SHIPPING ORDER

VALUE THRESHOLD Variable Average number

of online orders above $35

Average number of offline orders above $35


.021 (.021)

-.010* (.005)

Failure experiencer -.008 (.016)

.007 (.004)

Post shock .118*** (.015)

.103*** (.004)

Intercept .414*** (.011)

.444*** (.003)

R-squared .017 .013 N 8,178 109,836 Mean Y .50 .48

Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = ?. DID = Difference-in-Differences.

We realize that much of the evidence for the mechanisms is descriptive and suggestive in

nature. Nevertheless, overall, the evidence is consistent with the asymmetry in the effect of app

failure on shopping outcomes across the two channels.

Next, we examine the heterogeneity in the sensitivity of shoppers to app failures in two ways.

We use a theory-based moderator approach as well as a data-driven machine learning approach.

Moderators: Relationship Strength and Prior Digital Use

The literatures on relationship marketing and service recovery suggest two factors may moderate

the impact of app failures on outcomes: relationship strength and prior digital channel use.

Relationship Strength. The service marketing literature offers mixed evidence on the

moderating role of the strength of customer relationship with the firm in the effect of service

failure on shopping outcomes. Some studies suggest that stronger relationship may aggravate the

effect of failures on product evaluation, satisfaction, and on purchases (Chandrashekaran et al.

2007; Gijsenberg et al. 2015; Goodman et al. 1995). Other studies show that stronger

relationship attenuates the negative effect of service failures (Hess et al. 2003; Knox and van

Oest 2014). Consistent with the direct marketing literature (Bolton 1998; Schmittlein et al.

27

1987), we operationalize customer relationship using RFM (recency, frequency, and monetary

value) dimensions. Because of high correlation between the interactions of frequency with

(failure experiencers x post shock) and value of purchases with (failure experience x post shock)

(r = .90, p < .001) and because value of purchases is more important for the retailer, we drop

frequency of past purchases.

Prior Digital Channel/Online Use/Experience. The moderating effect of a shopper’s prior

digital channel/online use or experience with the retailer on app failure’s impact on shopping

outcomes could be positive or negative. On the one hand, more digitally experienced app users

may be less susceptible to the negative impact of an app crash on subsequent engagement with

the app than less digitally experienced app users (Shi et al. 2017) because they are conditioned to

expect some level of technology failures, consistent with the product harm crises literature

(Cleeren et al. 2013; Liu and Shankar 2015) and the expectation-confirmation theory (Cleeren et

al. 2008; Oliver 1980; Tax et al. 1998,). On the other hand, prior digital exposure and experience

with the firm may heighten shopper expectations and make them less tolerant of failures. We

operationalize this variable as the cumulative number of purchases that the shopper made from

the retailer’s website prior to experiencing a failure.

The results of the model with relationship strength and past digital channel use as moderators

appear in Table 10. Consistent with our expectation, the monetary value of past purchases has

positive and significant interaction coefficients with the DID model variable across all the

outcome variables (p < .001). Thus, app failures have a smaller effect on shoppers with stronger

relationship with the retailer, consistent with the results of Ahluwalia et al. (2001). Recency has

negative coefficients (p < .001), suggesting that the more recent shoppers are less tolerant of

28

failure. A failure shock also affects the frequency, quantity, and value of purchases (p < .001) of

shoppers with greater digital channel or online purchase experience with the retailer more.

Table 10 DID MODEL RESULTS OF FAILURE SHOCKS FOR PURCHASES ACROSS CHANNELS:

MODERATING EFFECTS OF RELATIONSHIP WITH RETAILER AND PAST ONLINE PURCHASE FREQUENCY



Value of purchases


-.193*** (.012)

-.389*** (.03)

-12.935*** (.879)

DID x Past value of purchases .000*** (.000)

.000*** (.000)

.017*** (.001)

DID x Recency of purchases -.001*** (.000)

-.003*** (.000)

-.022*** (.006)

DID x Past online purchase frequency

-.019*** (.003)

-.029*** (.007)

-1.344*** (.220)

Past value of purchases .000*** (.000)

.001*** (.000)

.024*** (.000)

Recency .005*** (.000)

.009*** (.000)

.206*** (.003)

Past online purchase frequency .002 (.001)

.007 (.004)

-.638*** (.105)

Failure experiencer .007 (.007)

.037* (.017)

.589 (.497)

Post shock .178*** (.007)

.236*** (.017)

26.947*** (.505)

Intercept .640*** (.006)

1.141*** (.015)

25.034*** (.446)

R squared .159 .122 .093 Mean Y .839 1.643 44.070

Notes: DID = Difference-in-Differences. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = 273,378. Heterogeneity in Shoppers’ Sensitivity to App Failures (Treatment Effect)

In addition to the service marketing literature based moderator variables examined earlier, , we

also explore heterogeneity in treatment effects relating to additional managerially useful

observed variables (e.g., gender, loyalty level) not fully examined by prior research.

Unfortunately, including these variables as additional moderators in the DID analysis explodes

the number of main and interaction effects.

Recent methods of causal inference using machine learning such as “causal forest” allow us to

recover individual-level conditional average treatment effects (CATE) (Athey et al. 2017; Wager

29

and Athey 2018). The causal forest is an ensemble of causal trees that averages the predictions of

treatment effects produced by each tree for thousands of trees.9 It has been applied in marketing

to model customer churn and information disclosure (Ascarza 2018; Guo et al. 2018).

The estimates from causal forest using 1,000 trees appear in Web Appendix Table A1. About

96% of the shoppers have a negative value of CATE with an average of -1.739. The distribution

of CATE across shoppers appears in Web Appendix Figure A1. The shopper quintiles based on

CATE levels reflects this distribution in Web Appendix Figure A2, which shows that Segment 1

of the most sensitive shoppers exhibit higher variance than the rest.

Next, we regress the CATE estimate on the covariate space to identify the covariates that best

explain treatment heterogeneity. The results appear in Web Appendix Table A2. They show that

all the covariates, including gender and loyalty are significant (p < .001). Shoppers with higher

value of past purchases and more frequent online purchases are less sensitive to an app failure

than others. Shoppers who bought more recently in the past are less tolerant of an app failure.

Some of these results complement those from the moderator analysis.

The causal forest-derived CATE regression differs from the moderator DID regression in

important ways. First, the moderator regression uses the entire sample for estimation, while the

causal forest, the basis for the CATE regression, uses a subset of the data (the training sample)

for estimation. Second, the causal forest underlying the CATE regression splits the training data

further to estimate an honest tree, estimating from an even smaller subset of the moderator

regression sample. Third, relative to the linear moderator regression, the CATE regression can

handle a much larger number of covariates. Because of these differences, the results of the

CATE regression model may not exactly mirror those of the moderator regression model.

9 In Web Appendix A, we provide an overview of causal trees and describe the algorithm for estimating a single causal tree followed by bagging a large number of causal trees into a forest.

30

ROBUSTNESS CHECKS AND RULING OUT ALTERNATIVE EXPLANATIONS

We perform several robustness checks and tests to rule out alternative explanations for the effect

of app failure on purchases.

Alternative model specifications. Although the failure in our data is exogenous, to be sure, in

addition to our proposed DID model, we also estimate models with shopper covariates to

estimate the treatment effect of interest. Additionally, we estimate Poisson count data models for

the frequency and quantity variables. The results from these models replicate the findings from

Tables 3 and 4 and appear in the Web Appendix Tables B1-B2 and C1-C2, respectively. The

coefficients of the treatment effect from Table B1 and C1 represent changes in outcomes due to

app failures, conditioned on covariates. These results are substantively similar to those in Tables

3 and 4. The insensitivity of the results to control variables suggests that the effect of

unobservables relative to these observed covariates would have to be very large to significantly

change our results (Altonji et al. 2005). Similarly, the results are robust to a Poisson

specification, reported in Tables B2 and C2.

Outliers. We re-estimate the models by removing outliers (extremely heavy spenders who are

greater than three standard deviations away from the mean in monetary value of purchases in the

pre-period) from our data. Web Appendix Tables B3 and C3 report these results. We find the

results to be consistent with and even stronger than those reported earlier.

Existing shoppers. Another possible explanation for app failures’ effect can be that only new

or dormant shoppers are sensitive to failures, perhaps due to low switching costs. Therefore, we

remove those with no purchases in the last 12 months to see if their behavior is similar to that of

the existing shoppers. Indeed, Web Appendix Tables B4 and C4 report substantively similar

results after excluding the new or dormant shoppers.

31

Alternative measures of digital channel use moderators. In lieu of past online purchases

frequency as a measure of prior digital channel use, we use measures based on median split in

the number, share of online purchases, and prior app usage in the time between app launch and

server failure in the app. The results for alternative online purchase measures are almost the same

as our proposed model results, except for prior app usage. Shoppers who use the app more

frequently appear to be less sensitive to failures as shown in Web Appendix Tables B5 and C5.

Regression discontinuity analysis. To ensure that there are no unobservable differences

between failure experiencers and non-experiencers based on the time of login, we carry out a

‘regression discontinuity’ (RD) style analysis in the one hour before the start time of the service

failure. For the RD analysis, we consider only app users in the neighborhood of this time, using

as control group those users who logged in one hour before and after the failure period and as

treated the users who logged in during the failure period. The results are substantively similar to

our main model results and are reported in Web Appendix Tables B6 and C6.

Longer-term effect of failures. Our main analysis shows 14-day effects of app failures. To

explore if these effects continue over longer periods of time, we examined the outcomes four

weeks pre- and post- failure event. There is a steep fall in the period immediately after the

failure. However, purchases climb back to higher levels over the next three weeks. Nevertheless,

they return to levels lower than the pre-period average. Thus, the diminished impact of failure

persists over time. These patterns appear in Figure 6 and Web Appendix Table D1. The table

shows the coefficients of the interactions of weekly dummies with TREAT for a DID regression.

Because an app failure occurs every 7-8 weeks, we estimate the effects four weeks pre and post

so as to avoid our pre- or post- periods overlapping with other failures.

32

Figure 6 APP FAILURE EFFECTS ON VALUE OF PURCHASES OVER FOUR WEEKS

Note: The effects for all but one of the pre app failure weeks are insignificant. The horizontal line is average treatment effect.

Stacked model for channel effects. The results for online and offline purchases in Table 4 do

not show the relative sizes of the effects across the two channels. To examine these relative

effects, we estimate a stacked model of online and offline outcomes that includes a channel

dummy. The results for this model appear in Web Appendix Table D2. We interpret the effects

as a proportion of the purchases within the channel and conclude that the effects in the offline

channel are more negative than those in the online channel (p < .01). We also estimated a DID

regression model with value of purchases in the offline channel as a proportion of total purchases

and found negative and significant effects of failure (p < .01).

DISCUSSION, MANAGERIAL IMPLICATIONS, AND LIMITATIONS

Summary

In this paper, we addressed novel research questions: What is the effect of a service failure in a

retailer’s mobile app on the frequency, quantity, and monetary value of purchases in online and

offline channels? What possible mechanisms may explain these effects? How do shoppers’

relationship strength and prior digital channel use moderate these effects? How heterogeneous is

shoppers’ sensitivity to failures? By answering these questions, our research fills an important

gap at the crossroads of three disparate streams of research in different stages of development:

33

the mature stream of service failures, the growing stream of omnichannel marketing, and the

nascent stream of mobile marketing. We leveraged a random systemwide failure in the app to

measure the causal effect of app failure. To our knowledge, this is the first study to causally

estimate the effects of digital service failure using real world data. Using unique data spanning

online and offline retail channels, we examined the spillover effects of such failures across

channels and examined heterogeneity in these effects based on channels and shoppers.

Our results reveal that app failures have a significant negative effect on shoppers’ frequency,

quantity, and monetary value of purchases across channels. These effects are heterogeneous

across channels and shoppers. Interestingly, the overall decreases in purchases across channels

are driven by reductions in store purchases and not in digital channels. Furthermore, we find that

that shoppers with higher monetary value of past purchases are less sensitive to app failures.

Overall, our nuanced analyses of the mechanisms by which an app failure affects purchases

offer new and insightful explanations in a cross-channel context. Our findings are consistent with

the view that some customers may be tolerant of technological failures (Meuter et al. 2000).

Finally, our study offers novel insights into the cross-channel implications of app failures.

Economic Significance

The economic effects of failures are sizeable for any retailer to alter its service failure preventive

and recovery strategies. Based on our estimates, the economic impact of an app failure is a

revenue loss of about $.48 million.10 The retailer experiences about 5-7 failures each year,

resulting in an annual loss of $2.4-$3.4 million. This loss may not amount to a sizeable portion of

10 We compute this figure by using the weekly effect coefficients in Table D1, i.e., $(.82 + 1.70 + .58 + .53)*N for the first four weeks and $(.64)*5*N assuming that the fifth week’s effects remain for another five weeks until the next failure for N = 70,568 failure experiencers, totaling $.48 million.

34

the retailer’s annual revenues. However, given the low retail margins and the retailer’s

vulnerable financial condition, it forms a substantial amount for the retailer.

The economic effect is meaningful for several reasons. First, retailers operate on thin margins

(2-3% in many categories) and are cost-conscious, such an economic loss is impactful. Second,

the effect size of 7.1% from our results is consistent with and even higher than those from other

similar causal studies. For example, exposure to banner advertising has been shown to lift

purchase intention by .473% worth 42 cents/click to the firm (Goldfarb and Tucker 2011).

Goldfarb and Tucker (2011) argue, “although the coefficient may seem small, it suggests an

economically important impact of online advertising.” Third, in the mobile context, the effect of

being in a crowd (of five people relative to two per square meter when receiving a mobile

promotion) results in an economically meaningful 2.2% more clicks (Andrews et al. 2015).

Fourth, Akca and Rao (2020) argue that a revenue drop of $5.32 million is economically

significant for a large company such as Orbitz. Fifth, as sales through the mobile app and online

sales are growing rapidly, this effect is only getting larger. Sixth, our estimates are for one two-

hour app failure in a year. Finally, the effects continue over a longer five-week period.

Managerial Implications

Service failure and low-quality service likely lead to termination (Sriram et al. 2015). The

insights from our research better inform executives in managing their mobile app and channels

and offer practitioner implications for service failure preventive and recovery strategies.

Preventive Strategies. Managers can use the estimate that an app failure results in a 7.1%

decrease in monetary value of purchases to budget resources for their efforts to prevent or reduce

app failures. The result that the adverse effect of failure is lower for shoppers closer to purchase

and purchasing less recently offers interesting pointers for retailers to prevent damage to their

35

brands and revenues. In general, managers should encourage shoppers to use the app more, get

closer to purchase, and purchase more through the app. Managers could offer limited-time

incentives to shoppers who have not clicked the checkout or purchase tabs in the app.

By identifying failure-sensitive shoppers based on relationship strength, prior digital use, and

individual-level CATE estimates, managers can take proactive actions to prevent these shoppers

from reducing their shopping intensity with the firm. Figure 7 represents the loss of revenues

(spending) from each percentile of shoppers at different levels of failure sensitivity.

Figure 7 RETAILER’S REVENUE LOSS BY PERCENTILE OF SHOPPERS EXPERIENCING APP FAILURE

Note: CATE = Conditional Average Treatment Effect.

About 70% of the losses in revenues due to failure arise from just 47% of the shoppers.

Managers can manage these shoppers’ expectations through email and app notification

messaging channels. Warning shoppers of typical number of disruptions in the app can preempt

negative attributions and attitudes, and limit potential brand dilution and drop in revenues due to

app failure.

Recovery Strategies. The finding that app failures result in reduced purchases across channels

suggests that managers should develop interventions and recovery strategies to mitigate the

negative effects of app failures not just in the mobile channel, but also in other channels, in

particular, the offline channel. Thus, seamlessly integrating data from a mobile app with data

36

from its stores and websites can help an omnichannel retailer build continuity in shoppers’

experiences.

Immediately after a shopper experiences an app failure, the manager of the app should

provide gentle nudges and even incentives for the shopper to complete an abandoned transaction

on the app. Typically, a manager may need to provide these nudges and incentives through other

communication channels such as email, phone call, or face-to-face chat. These nudges are similar

in spirit and execution to those from firms like Fitbit and Amazon, who remind customers

through email to reconnect when they disconnect their watch and smart speaker, respectively. If

the store is a dominant channel for the retailer, the retailer should use its store associates to

reassure or incentivize shoppers. In some cases, managers can even offer incentives in other

channels to complete a transaction disrupted by an app failure.

Because diminished purchases after failure result from reduced engagement, managers

should aim to enhance engagement after a systemwide failure. Once a failure is restored,

managers could induce shoppers to use the app more through gamification features in the app or

providing enhanced loyalty points for logging back into the app.

The finding that app failure can enhance spending for shoppers experiencing the failure close

to the store offers useful cross-selling opportunities for the retailer. After a systemwide failure is

resolved, retailers can proactively promote in the store nearest to each failure-experiencing

shopper products based on the shoppers’ purchase history.

Managers should mitigate the negative effects of app failures for the most sensitive shoppers

first. They should proactively identify failure-sensitive shoppers and design preemptive

strategies to mitigate any adverse effects. We find that shoppers with weaker relationship with

the provider are more sensitive to failures. Thus, firms should address such shoppers for recovery

37

after a careful cost-benefit analysis. This is important because apps serve as a gateway for future

purchases for these shoppers.

Finally, our analysis of heterogeneity in shoppers’ sensitivity to app failures suggests that

managers should satisfy first the shoppers with the highest values of CATE. Interventions

targeted at the 47% of the shoppers who contribute to 70% of losses could lead to higher returns.

Limitations

Our study has limitations that future research can address. First, we have data on a limited

number of failures, so we could not fully explore all the failures with varying durations. Second,

our results are most informative for similar retailers that have a large brick-and-mortar presence

but growing online and in-app purchases. If data are available, future research could study app

failures for primarily online retailers with an expanding offline presence (e.g., Bonobos, Warby

Parker). Third, we do not have data on competing apps that shoppers may use. Additional

research could study shoppers’ switching behavior if data on competing apps are available.

Fourth, our data contain relatively low number of purchases in the mobile channel. For better

generalizability of the extent of spillover across channels, our analysis could be extended to

contexts in which a substantial portion of purchases are made within the app. Fifth, we do not

have data on purchases made through the app vs. mobile browser. Studying differences between

these two mobile sub-channels is a fruitful future research avenue. Finally, mobile apps may be

an effective way to recover from the adverse effects of service failures (Tucker and Yu 2018).

Our approach also provides a way to identify app-failure sensitive shoppers, but we do not have

data on shoppers’ responses to service recovery to recommend the best mitigation strategy. The

strategies we do recommend could be tested in ethically permissible field studies.

38

REFERENCES

Ahluwalia, Rohini, H. Rao Unnava, and Robert E. Burnkrant (2001), “The Moderating Role of Commitment on the Spillover Effect of Marketing Communications,” Journal of Marketing Research, 38 (4), 458–70.

Akca, Selin and Anita Rao (2020), “Value of Aggregators,” Marketing Science, 39 (5), 893–922. Altonji, Joseph G., Todd E. Elder, and Christopher R. Taber (2005), “Selection on Observed and

Unobserved Variables: Assessing the Effectiveness of Catholic Schools,” Journal of Political Economy, 113 (1), 151–84.

Andreassen, Tor Wallin (2016), “What Drives Customer Loyalty with Complaint Resolution?” Journal of Service Research, 1 (4), 324-32.

Andrews, Michelle, Xueming Luo, Zheng Fang, and Anindya Ghose (2015), “Mobile Ad Effectiveness: Hyper-Contextual Targeting with Crowdedness,” Marketing Science, 35 (2), 218–33.

Angrist, Joshua D. and Jörn-Steffen Pischke (2009), Mostly Harmless Econometrics: An Empiricist’s Companion, Princeton: Princeton University Press.

Ansari, Asim, Carl F. Mela, and Scott A. Neslin (2008), “Customer Channel Migration,” Journal of Marketing Research, 45 (1), 60-76.

Athey, Susan and Guido Imbens (2016), “Recursive Partitioning for Heterogeneous Causal Effects,” Proceedings of the National Academy of Sciences, 113 (27), 7353-7360.

Athey, Susan, Guido Imbens, Thai Pham, and Stefan Wager (2017), “Estimating Average Treatment Effects: Supplementary Analyses and Remaining Challenges,” American Economic Review, 107 (5), 278–81.

Avery, Jill, Thomas J. Steenburgh, John Deighton, and Mary Caravella (2012), “Adding Bricks to Clicks: Predicting the Patterns of Cross-Channel Elasticities Over Time,” Journal of Marketing, 76 (3), 96–111.

Barron’s (2018), “Walmart: Can It Meet Its Digital Sales Growth Targets?,” (accessed November 5, 2020), [available at https://www.barrons.com/articles/walmart-can-it-meet-its-digital-sales-growth-targets-1519681783].

Bell, David R., Santiago Gallino, and Antonio Moreno (2018), “Offline Showrooms in Omni-channel Retail: Demand and Operational Benefits, Management Science, 64 (4), 1629-51.

Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan (2004), “How Much Should We Trust Differences-In-Differences Estimates?” The Quarterly Journal of Economics, 119 (1), 249–75.

Bitner, Mary Jo, Bernard H. Booms, and Mary Stanfield Tetreault (1990), “The Service Encounter: Diagnosing Favorable and Unfavorable Incidents,” Journal of Marketing, 54 (1), 71–84.

Blancco (2016), “The State of Mobile Device Performance and Health: Q2,” (accessed November 5, 2020), [available at https://www2.blancco.com/en/research-study/state-of-mobile-device-performance-and-health-trend-report-q2-2016].

Bolton, Ruth N. (1998), “A Dynamic Model of the Duration of the Customer’s Relationship with a Continuous Service Provider: The Role of Satisfaction,” Marketing Science, 17 (1), 45–65.

Brynjolfsson, Erik, Yu Jeffery Hu, and Mohammad S Rahman (n.d.), “Competing in the Age of Omnichannel Retailing,” MIT Sloan Management Review, (accessed November 5, 2020), [available at https://sloanreview.mit.edu/article/competing-in-the-age-of-omnichannel-retailing/].

Chandrashekaran, Murali, Kristin Rotte, Stephen S. Tax, and Rajdeep Grewal (2007), “Satisfaction Strength and Customer Loyalty,” Journal of Marketing Research, 44 (1), 153–63.

39

Chintagunta, Pradeep K., Junhong Chu, and Javier Cebollada (2011), “Quantifying Transaction Costs in Online/Off-line Grocery Channel Choice,” Marketing Science, 31 (1), 96–114.

Cleeren, Kathleen, Marnik G. Dekimpe, and Kristiaan Helsen (2008), “Weathering Product-harm Crises,” Journal of the Academy of Marketing Science, 36 (2), 262–70.

Cleeren, Kathleen, Harald J. van Heerde, and Marnik G. Dekimpe (2013), “Rising from the Ashes: How Brands and Categories can Overcome Product-Harm Crises,” Journal of Marketing, 77 (2), 58-77.

Computerworld (2014), “iOS 8 app crash rate falls 25% since release,” Computerworld, (accessed November 5, 2020), [available at https://www.computerworld.com/article/2841794/ios-8-app-crash-rate-falls-25-since-release.html].

Dimensional Research (2015), “Mobile User Survey: Failing to Meet User Expectations,” TechBeacon, (accessed November 5, 2020), [available at https://techbeacon.com/resources/survey-mobile-app-users-report-failing-meet-user-expectations].

Dotzel, Thomas, Venkatesh Shankar, and Leonard L. Berry (2013), “Service Innovativeness and Firm Value,” Journal of Marketing Research, 50 (2), 259-76.

Fong, Nathan M., Zheng Fang, and Xueming Luo (2015), “Geo-Conquesting: Competitive Locational Targeting of Mobile Promotions,” Journal of Marketing Research, 52 (5), 726–35.

Forbes, Lukas P. (2008), “When Something Goes Wrong and No One is Around: Non‐internet Self‐service Technology Failure and Recovery,” Journal of Services Marketing, 22 (4), 316–27.

Forbes, Lukas P., Scott W. Kelley, and K. Douglas Hoffman (2005), “Typologies of e‐commerce retail failures and recovery strategies,” Journal of Services Marketing, (S. Baron, K. Harris, and D. Elliott, eds.), 19 (5), 280–92.

Forman, Chris, Anindya Ghose, and Avi Goldfarb (2009), “Competition between Local and Electronic Markets: How the Benefit of Buying Online Depends on Where You Live,” Management Science, 55 (1), 47–57.

Ghose, Anindya, Hyeokkoo Eric Kwon, Dongwon Lee, and Wonseok Oh (2018), “Seizing the Commuting Moment: Contextual Targeting Based on Mobile Transportation Apps,” Information Systems Research, 30 (1), 154-74.

Gijsenberg, Maarten J., Harald J. Van Heerde, and Peter C. Verhoef (2015), “Losses Loom Longer than Gains: Modeling the Impact of Service Crises on Perceived Service Quality over Time,” Journal of Marketing Research, 52 (5), 642-56.

Goldfarb, Avi and Catherine Tucker (2011), “Online Display Advertising: Targeting and Obtrusiveness,” Marketing Science, 30 (3), 389–404.

Google (2020), “Find Out How You Stack Up to New Industry Benchmarks for Mobile Page Speed,” Think with Google, (accessed November 5, 2020), [available at https://www.thinkwithgoogle.com/marketing-strategies/app-and-mobile/mobile-page-speed-new-industry-benchmarks/].

Google M/A/R/C Study (2013), “Mobile in-store Research: How In-store Shoppers are Using Mobile Devices,” Google M/A/R/C.

Guo, Tong, S. Sriram, and Puneet Manchanda (2017), “The Effect of Information Disclosure on Industry Payments to Physicians,” SSRN Scholarly Paper, Rochester, NY: Social Science Research Network.

Halbheer, Daniel, Dennis L. Gärtner, Eitan Gerstner, and Oded Koenigsberg (2018), “Optimizing Service Failure and Damage Control,” International Journal of Research in Marketing, 35 (1), 100–15.

40

Hansen, Nele, Ann-Kristin Kupfer, and Thorsten Hennig-Thurau (2018), “Brand Crises in the Digital Age: The Short- and Long-term Effects of Social Media Firestorms on Consumers and Brands,” International Journal of Research in Marketing, 35 (4), 557–74.

Hess, Ronald L., Shankar Ganesan, and Noreen M. Klein (2003), “Service Failure and Recovery: The Impact of Relationship Factors on Customer Satisfaction,” Journal of the Academy of Marketing Science, 31 (2), 127–45.

Hoffman, K. Douglas and John E. G. Bateson (2001), Essentials of Services Marketing: Concepts, Strategies and Cases, Fort Worth: South-Western College Pub.

Kim, Su Jung, Rebecca Jen-Hui Wang, and Edward C. Malthouse (2015), “The Effects of Adopting and Using a Brand’s Mobile Application on Customers’ Subsequent Purchase Behavior,” Journal of Interactive Marketing, 31, 28–41.

Knox, George and Rutger van Oest (2014), “Customer Complaints and Recovery Effectiveness: A Customer Base Approach,” Journal of Marketing, 78 (5), 42-57.

Liu, Yan and Venkatesh Shankar (2015), “The Dynamic Impact of Product-Harm Crises on Brand Preference and Advertising Effectiveness: An Empirical Analysis of the Automobile Industry,” Management Science, 61 (10), 2514–35.

Ma, Liye, Baohong Sun, and Sunder Kekre (2015), “The Squeaky Wheel Gets the Grease—An Empirical Analysis of Customer Voice and Firm Intervention on Twitter,” Marketing Science, 34 (5), 627–45.

McCollough, Michael A., Leonard L. Berry, and Manjit S. Yadav (2016), “An Empirical Investigation of Customer Satisfaction after Service Failure and Recovery,” Journal of Service Research, 3 (2), 121-37.

Meuter, Matthew L., Amy L. Ostrom, Robert I. Roundtree, and Mary Jo Bitner (2018), “Self-Service Technologies: Understanding Customer Satisfaction with Technology-Based Service Encounters,” Journal of Marketing, 64 (3), 50-64.

Narang, Unnati and Venkatesh Shankar (2019), “Mobile App Introduction and Online and Offline Purchases and Product Returns,” Marketing Science, 38 (5), 756–72.

National Retail Federation (2018), “Top 100 Retailers 2018,” NRF, (accessed November 5, 2020), [available at https://nrf.com/resources/top-retailers/top-100-retailers/top-100-retailers-2018].

Neumann, Nico, Catherine E Tucker, and Timothy Whitfield (2019), “Frontiers: How Effective Is Third-Party Consumer Profiling? Evidence from Field Studies,” Marketing Science, 38 (6), 918-26.

Oliver, Richard L. (1980), “A Cognitive Model of the Antecedents and Consequences of Satisfaction Decisions,” Journal of Marketing Research, 17 (4), 460–69.

Pauwels, Koen and Scott A. Neslin (2015), “Building With Bricks and Mortar: The Revenue Impact of Opening Physical Stores in a Multichannel Environment,” Journal of Retailing, Multi-Channel Retailing, 91 (2), 182–97.

Schmittlein, David C., Donald G. Morrison, and Richard Colombo (1987), “Counting Your Customers: Who Are They and What Will They Do Next?” Management Science, 33 (1), 1–24.

Shi S, Kalyanam K, Wedel M (2017), “What Does Agile and Lean Mean for Customers? An Analysis of Mobile App Crashes. Working Paper, Santa Clara University.

Smith, Amy K. and Ruth N. Bolton (1998), “An Experimental Investigation of Customer Reactions to Service Failure and Recovery Encounters: Paradox or Peril?,” Journal of Service Research, 1 (1), 65–81.

Sriram, S., Pradeep K. Chintagunta, and Puneet Manchanda (2015), “Service Quality Variability and Termination Behavior,” Management Science, 61 (11), 2739–59.

41

Tax, Stephen S., Stephen W. Brown, and Murali Chandrashekaran (1998), “Customer Evaluations of Service Complaint Experiences: Implications for Relationship Marketing,” Journal of Marketing, 62 (2), 60–76.

Tucker, Catherine E, and Shuyi Yu (2019), “Does IT lead to More Equal treatment? An Empirical Study of the Effect of Smartphone use on Customer Complaint Resolution,” Working Paper, Massachusetts Institute of Technology.

Wager, Stefan and Susan Athey (2018), “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests,” Journal of the American Statistical Association, 113 (523), 1228–42.

Wang, Kitty and Avi Goldfarb (2017), “Can Offline Stores Drive Online Sales?” Journal of Marketing Research, 54 (5), 706-19.

Xu, Kaiquan, Jason Chan, Anindya Ghose, and Sang Pil Han (2016), “Battle of the Channels: The Impact of Tablets on Digital Commerce,” Management Science, 63 (5), 1469–92.

i

WEB APPENDIX A CAUSAL FOREST

Causal Trees: Overview

A causal tree is similar to a regression tree. The typical objective of a regression tree is to build

accurate predictions of the outcome variable by recursively splitting the data into subgroups that

differ the most on the outcome variable based on covariates. A regression tree has

decision/internal/split nodes characterized by binary conditions on covariates and leaf or terminal

nodes at the bottom of the tree. The regression tree algorithm continuously partitions the data,

evaluating and re-evaluating at each node to determine (a) whether further splits would improve

prediction, and (b) the covariate and the value of the covariate on which to split. The goodness-

of-fit criterion used to evaluate the splitting decision at each node is the mean squared error

(MSE) computed as the deviation of the observed outcome from the predicted outcome. The tree

algorithm continues making further splits as long as the MSE decreases by more than a specified

threshold.

The causal tree model adapts the regression tree algorithm in several ways to make it

amenable for causal inference. First, it explicitly moves the goodness-of-fit-criterion to treatment

effects rather than the MSE of the outcome measure. Second, it employs “honest” estimates, that

is, the data on which the tree is built (splitting data) are separate from the data on which it is

tested for prediction of heterogeneity (estimating data). Thus, the tree is honest if for a unit i in

the training sample, it only uses the response Yi to estimate the within-leaf treatment effect, or to

decide where to place the splits, but not both (Athey and Imbens 2016; Athey et al. 2017). To

avoid overfitting, we use cross-validation approaches in the tree-building stage.

ii

Importantly, the goodness-of-fit criterion for causal trees is the difference between the

estimated and the actual treatment effect at each node. While this criterion ensures that all the

degrees of freedom are used well, it is challenging because we never observe the true effect.

Causal Tree: Goodness-of-fit Criterion

Following Wager and Athey (2018), if we have n independent and identically distributed training

examples labeled i = 1, ..., n, each of which consists of a feature vector Xi Î [0, 1]d, a response

Yi Î R, and a treatment indicator Wi Î [0, 1], the CATE at x is:

(2) 𝜏(𝑥) = 𝔼[𝑌!$ −𝑌!#|𝑋! = 𝑥]

We assume unconfoundedness, i.e., conditional on Xi, the treatment Wi is independent of

outcome Yi. Because the true treatment effect is not observed, we cannot directly compute the

goodness-of-fit criterion for creating splits in a tree. This goodness-of-fit criterion is as follows.

(3) 𝑄!'()*+!,-) = 𝔼[((𝜏!(𝑋!) −𝜏.5(𝑋𝑖))%]

Because 𝜏!(𝑋!) is not observed, we follow Athey and Imbens’s (2016) approach to create a

transformed outcome 𝑌!∗that represents the true treatment effect. Assume that the treatment

indicator Wi is a random variable. Suppose there is a 50% probability for a unit i to be in the

treated or the control group, an unbiased true treatment effect can be obtained for that unit by just

using its outcomes Y in the following way. Let

(4) 𝑌!∗ = 2𝑌! 𝑖𝑓𝑊! = 0and 𝑌!∗ = −2𝑌! 𝑖𝑓𝑊! = 1

It follows that:

(5)𝔼[𝑌!∗] = 2. ($%𝔼[𝑌!(1)] −

$%𝔼[𝑌!(0)]) = 𝔼[𝜏!]

Therefore, we can compute the goodness-of-fit criterion for deciding node splits in a causal

tree using the expectation of the transformed outcome (Athey and Imbens 2016). Once we

generate causal trees, we can compute the treatment effect within each leaf because it has a finite

iii

number of observations and standard asymptotics apply within a leaf. The differences in the

treated and control units’ outcomes within each leaf produces the treatment effect in that leaf.

Causal Forest Ensemble

In the final step, we create an ensemble of trees using ideas from model averaging and bagging.

Specifically, we take predictions from thousands of trees and average over them (Guo et al.

2018). This step retains the unbiased, honest nature of tree-based estimates but reduces the

variance. The forest averages over the estimates from B trees in the following manner.

(6) 𝜏5(𝑥) = 𝐵0$∑ 𝜏5,(𝑥)1,2$

Because monetary value of purchases is the key outcome variable of interest to the retailer, we

estimate individual level treatment effect on value of purchases for each failure experiencer

separately using the observed covariate data. These covariates include gender and loyalty

program in addition to the three theoretically-driven moderators, namely, value of past

purchases, recency of past purchases and online buying/digital experience.11 These individual

attributes are important for identifying individual-level effects and for developing targeting

approaches (e.g., Neumann et al. 2019). We use a random sample of two-thirds of our data as

training data and the remaining one-third as test data for predicting CATE. We use half of the

training data to maintain honest estimates and for cross-validation to avoid overfitting. The

results appear in Tables A1 and A2.

Table A1 CAUSAL FOREST RESULTS: SUMMARY OF INDIVIDUAL SHOPPER TREATMENT EFFECT FOR VALUE

OF PURCHASES Ntest Mean SD

𝝉" 45,563 -1.660 1.136 𝝉"| 𝝉" < 0 43,748 -1.739 1.089 𝝉"| 𝝉" > 0 1,815 .239 .198

Note:�̂� represents the estimated Conditional Average Treatment Effect (CATE) for each individual in the test data.

11 Age and zip code information were not available for all the shoppers in our data period because the retailer followed strict privacy guidelines.

iv

Table A2

RESULTS OF CAUSAL FOREST POST-HOC CATE REGRESSION FOR VALUE OF PURCHASES Variable Coefficient (Standard Error) Intercept -.958***(.012) Past value of purchases .000***(.000) Recency of purchases -.005***(.000) Past online purchase frequency .037***(.002) Gender (female) -.190***(.008) Loyalty program -.340***(.011) R squared .493

Note: *** p < .001. N = 45,563

Figure A1 CAUSAL FOREST RESULTS: INDIVIDUAL CATE

Figure A2 CAUSAL FOREST RESULTS: QUINTILES BY CATE

Note: Segment 1 represents shoppers most adversely affected by failure while Segment 5 represents those who are least adversely affected.

v

WEB APPENDIX B ROBUSTNESS CHECK FOR TABLE 3 (MAIN TREATMENT EFFECT) RESULTS

In this section, we present the results for robustness checks for the main estimation in Table 3 relating to: (a) alternative models with covariates and using Poisson model (Tables B1-B2), (b) outliers (Table B3), (c) existing shoppers (Table B4), (d) alternative measures for prior use of digital channels (Table B5), and (e) regression-discontinuity style analysis (Table B6).

Table B1 ROBUSTNESS OF TABLE 3 RESULTS TO INCLUSION OF COVARIATES ACROSS CHANNELS



Value of purchases


-.025* (.011)

-.063* (.027)

-2.092** (.763)

Failure experiencer

-.018* (.008)

-.024 (.019)

-.624 (.539)

Post shock .180*** (.008)

.238*** (.019)

27.182*** (.549)

Gender -.050*** (.011)

-.112*** (.028)

-3.367*** (.809)

Loyalty program

-.171*** (.006)

-.416*** (.015)

-8.733*** (.415)

Intercept .813*** (.006)

1.678*** (.014)

33.900*** (.395)

R squared .0114 .0081 .0221 Mean Y .82 1.61 43.31

Notes: N = 273,378. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. DID = Difference-in-Differences.

Table B2 DID POISSON MODEL RESULTS ACROSS CHANNELS




-.0209* (.0124)

-.0309*** (.0158)

Failure experiencer -.0282*** (.0089)

-.0197*** (.0117)

Post shock .2133*** (.0089)

.1440*** (.0111)

Intercept -.2872*** (.0063)

.4209*** (.0081)

Log pseudo-likelihood -378,710 -711,963

Mean Y .82 1.61 Notes: Robust standard errors in parentheses; *** p < .001, ** p < .01, * p < .05. N = 273,378. DID = Difference-in-Differences.

vi

Table B3 ROBUSTNESS OF TABLE 3 RESULTS TO OUTLIER SPENDERS Variable Frequency

of purchases Quantity of purchases

Value of purchases


-.023* (.010)

-.055* (.025)

-2.139** (.723)


-.031 (.018)

-.742 (.511)

Post shock .184*** (.007)

.256*** (.018)

27.439*** (.519)

Intercept .739*** (.005)

1.489*** (.013)

29.907*** (.367)

R squared .004 .001 .012 Mean Y .81 1.59 42.69

Notes: N = 272,706. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. DID = Difference-in-Differences.

Table B4

ROBUSTNESS OF TABLE 3 RESULTS TO EXISTING SHOPPERS ACROSS CHANNELS Variable Frequency


Value of purchases


-.025* (.01)

-.061* (.026)

-2.283** (.744)


-.029 (.018)

-.684 (.526)

Post shock .178*** (.007)

.233*** (.019)

27.191*** (.534)

Intercept .766*** (.005)

1.556*** (.013)

31.414*** (.378)

R squared .0039 .0010 .0181 Mean Y .84 1.64 44.07

Notes: N = 267,534. Robust standard errors clustered by shoppers are in parentheses; *** p < .001. DID = Difference-in-Differences.

vii

Table B5 ROBUSTNESS OF TABLE 3 RESULTS TO ALTERNATIVE MEASURES OF DIGITAL CHANNEL USE

BASED ON APP USE FREQUENCY BEFORE FAILURE Variable Frequency of


Value of purchases

Failure experiencer x Post shock (DID) -.244*** (.024)

-.467*** (.064)

-22.284*** (1.837)

DID x Value of past purchases .000*** (.000)

.000*** (.000)

.019*** (.001)

DID x Recency of purchases -.003*** (.000)

-.007*** (.001)

-.137*** (.015)

DID x Past app use frequency -.004 (.006)

-.034* (.017)

1.682*** (.476)

Value of past purchases .000*** (.000)

.001*** (.000)

.021*** (.000)

Recency of purchases .008*** (.000)

.016*** (.000)

.347*** (.007)

Past app use frequency .041*** (.001)

.080*** (.001)

1.535*** (.038)

Failure experiencer .088*** (.010)

.226*** (.027)

3.272*** (.788)

Post shock .182*** (.010)

.214*** (.027)

35.054*** (.766)

Intercept .616*** (.010)

1.059*** (.026)

22.264*** (.744)

R squared .2019 .1508 .1072 Mean Y .84 1.64 44.07

Notes: N = 267,534. Robust standard errors clustered by shoppers are in parentheses; Each moderator interacts with the difference-in-differences (DID) term failure experiencers x post shock; *** p < .001. The observations include those of shoppers with at least one purchase in the past for computing recency.

Table B6 ROBUSTNESS OF TABLE 3 RESULTS TO REGRESSION DISCONTINUITY STYLE ANALYSIS



Value of purchases

Failure experiencer x Post shock (DID) -.045** (.016)

-.09* (.04)

-3.169** (1.167)


-.071* (.029)

-1.261 (.825)

Post shock .178*** (.015)

.231*** (.037)

26.385*** (1.075)

Intercept .759*** (.011)

1.538*** (.026)

31.287*** (.760)

R squared .0032 .0008 .0160

Mean Y .80 1.56 42.07 Notes: N = 198,432. Robust standard errors clustered by shoppers are in parentheses; *** p < .001. DID = Difference-in-Differences.

viii

WEB APPENDIX C ROBUSTNESS CHECK FOR TABLE 4 (BY CHANNEL) RESULTS

In this section, we present the results for robustness checks for the cross-channel estimation in Table 4 relating to (a) alternative models with covariates and using Poisson model (Tables C1-C2), (b) outliers (Table C3), (c) existing shoppers (Table C4), (d) alternative measures for prior use of digital channels (Table C5), and (e) regression-discontinuity style analysis (Table C6).

Table C1 ROBUSTNESS OF TABLE 4 RESULTS TO INCLUSION OF COVARIATES BY CHANNEL



Value of purchases



Value of purchases


-.023* (.010)

-.059* (.026)

-1.967** (.739)

-.002 (.002)

-.003 (.004)

-.125 (.165)

Failure experiencer

-.015* (.007)

-.019 (.018)

-.431 (.523)

-.003* (.001)

-.006* (.003)

-.194 (.117)

Post shock .171*** (.007)

.222*** (.019)

25.462*** (.532)

.009*** (.001)

.016*** (.003)

1.720*** (.119)

Gender -.05*** (.011)

-.109*** (.028)

-3.304*** (.784)

-.001 (.002)

-.002 (.004)

-.063 (.175)

Loyalty program

-.165*** (.006)

-.404*** (.014)

-8.306*** (.403)

-.006*** (.001)

-.012*** (.002)

-.427*** (.090)

Intercept .775*** (.005)

1.620*** (.014)

32.260*** (.382)

.038*** (.001)

.058*** (.002)

1.640*** (.085)

R squared .0112 .0079 .0208 .0006 .0005 .0018 Mean Y .78 1.56 41.08 .04 .06 2.23 Notes: N = 273,378. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01. DID = Difference-in-Differences.

Table C2 DID POISSON MODEL RESULTS BY CHANNEL

Offline Online Variable Frequency





-.0209* (.0127)

-.0311*** (.016)

-.019 (.0482)

-.0184 (.0643)

Failure experiencer -.0253*** (.009)

-.0169*** (.0119)

-.0885** (.0358)

-.0995*** (.0469)

Post shock .2133*** (.0091)

.1401*** (.0113)

.213*** (.0337)

.2453*** (.046)

Intercept -.3363*** (.0065)

.3851*** (.0083)

-3.3261*** (.025)

-2.9257*** (.033)

Log pseudo-likelihood 368,831 -698,829 -46,197 -69,852 Mean Y .78 1.56 .04 .06

Notes: Robust standard errors in parentheses; *** p < .001, ** p < .01, * p < .05. N = 273,378. DID = Difference-in-Differences.

ix

Table C3

ROBUSTNESS OF TABLE 4 RESULTS TO OUTLIER SPENDERS Offline Online



Value of purchases



Value of purchases


-.021* (.010)

-.053* (.024)

-2.068** (.701)

-.001 (.002)

-.002 (.004)

-.071 (.156)

Failure experiencer

-.018** (.007)

-.026 (.017)

-.571 (.496)

-.003* (.001)

-.006* (.003)

-.171 (.110)

Post shock .175*** (.007)

.24*** (.017)

25.727***(.504)

.009*** (.001)

.016*** (.003)

1.712*** (.112)

Intercept .704*** (.005)

1.437*** (.012)

28.479***(.356)

.035*** (.001)

.052*** (.002)

1.428*** (.079)

R squared .0042 .0012 .0180 .0000 .0000 .0020 Mean Y .78 1.53 40.51 .04 .06 2.18 Notes: N = 272,706. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, * p < .05. DID = Difference-in-Differences.

Table C4 ROBUSTNESS OF TABLE 4 RESULTS TO EXISTING SHOPPERS BY CHANNEL



Value of purchases



Value of purchases


-.024* (.010)

-.059* (.025)

-2.166** (.721)

-.002 (.002)

-.003 (.004)

-.117 (.161)

Failure experiencer

-.018* (.007)

-.024 (.018)

-.515 (.510)

-.003* (.001)

-.005 (.003)

-.169 (.114)

Post shock .170*** (.007)

.218*** (.018)

25.499*** (.518)

.008*** (.001)

.015*** (.003)

1.693*** (.115)

Intercept .730*** (.005)

1.501*** (.013)

29.882*** (.366)

.037*** (.001)

.055*** (.002)

1.532*** (.082)

R squared .0037 .0009 .0169 .0003 .0002 .0016 Mean Y .80 1.58 41.81 .04 .06 2.26

Notes: N = 267,534. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, * p < .05. DID = Difference-in-Differences.

x

Table C5 ROBUSTNESS OF TABLE 4 RESULTS TO ALTERNATIVE MEASURE OF DIGITAL CHANNEL USE

BASED ON APP USAGE FREQUENCY BEFORE FAILURE Offline Online



Value of Purchases



Value of purchases


-.216*** (.024)

-.415*** (.063)

-19.428*** (1.784)

-.029*** (.005)

-.052*** (.010)

-2.856*** (.425)

DID x Value of past purchases

.000*** (.000)

.000*** (.000)

.017*** (.001)

.000*** (.000)

.000*** (.000)

.001*** (.000)

DID x Recency of purchases

-.003*** (.000)

-.006*** (.001)

-.118*** (.015)

.000*** (.000)

.000*** (.000)

-.019*** (.004)

DID x Past app use frequency

-.008 (.006)

-.04* (.016)

1.146* (.462)

.005*** (.001)

.007* (.003)

.536*** (.110)

Value of past purchases

.000*** (.000)

.001*** (.000)

.021*** (.000)

.000*** (.000)

.000*** (.000)

.000*** (.000)

Recency of purchases

.008*** (.000)

.016*** (.000)

.335*** (.007)

.000*** (.000)

.001*** (.000)

.012*** (.002)

Past app use frequency

.037*** (.000)

.073*** (.001)

1.358*** (.037)

.003*** (.000)

.007*** (.000)

.177*** (.009)

Failure experiencer

.087*** (.010)

.222*** (.027)

3.240*** (.765)

.001 (.002)

.004 (.004)

.032 (.182)

Post shock .176*** (.010)

.202*** (.026)

32.959*** (.743)

.006** (.002)

.011* (.004)

2.096*** (.177)

Intercept .583*** (.010)

1.028*** (.025)

21.406*** (.723)

.034*** (.002)

.031*** (.004)

.859*** (.172)

R squared .1944 .1450 .1026 .0151 .0146 .0080

Mean Y .80 1.58 41.81 .04 .06 2.26 Notes: N = 267,534. Robust standard errors clustered by shoppers are in parentheses; *** p < .001. DID = Difference-in-Differences. The observations include those of shoppers with at least one purchase in the past for computing recency.

Table C6 ROBUSTNESS OF TABLE 4 RESULTS TO REGRESSION DISCONTINUITY STYLE ANALYSIS



Value of purchases



Value of purchases


-.044** (.016)

-.087* (.040)

-3.105** (1.131)

-.001 (.003)

-.003 (.006)

-.064 (.254)


-.062* (.028)

-1.017 (.800)

-.006** (.002)

-.009* (.004)

-.244 (.179)

Post shock .172*** (.015)

.218*** (.036)

24.835***(1.042)

.006* (.003)

.013* (.005)

1.550*** (.234)

Intercept .720*** (.010)

1.482*** (.026)

29.704***(.736)

.039*** (.002)

.056*** (.004)

1.583*** (.165)

R squared .0031 .0007 .0150 .0002 .0002 .0014 Mean Y .76 1.51 39.94 .04 .05 2.12

Notes: N = 198,432. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, * p < .10. DID = Difference-in-Differences.

xi

WEB APPENDIX D

OTHER ROBUSTNESS CHECKS

Table D1 EFFECTS OF APP FAILURE ON AVERAGE VALUE OF PURCHASES EACH WEEK

Variable Estimate (Standard error)

Treat x Week -4 -.09 (.28)

Treat x Week -3 -.58* (.25)

Treat x Week -2 -.45 (.24)

Treat x Week -1 -.37 (.25)

Treat x Week 0 -.82* (.37)

Treat x Week 1 -1.70** (.56)

Treat x Week 2 -.581 (.31)

Treat x Week 3 -.531 (.31)

Treat x Week 4 -.64* (.29)

Intercept 13.19*** (.09)

Mean Y 15.92

Notes: Robust standard errors clustered by shoppers are in parentheses; week and individual fixed effects are included. *** p < .001, ** p < .01, * p < .05, 1 p < .1. N = 1,366,890. DID = Difference-in-Differences. Week 5 in the pre-failure period (Week -5) is the base week.

Table D2 RESULTS OF DID MODEL WITH STACKED ONLINE AND OFFLINE PURCHASES AND CHANNEL

DUMMIES Variable Frequency of


Value of purchases


-.001 (.002)

-.003 (.003)

-.093 (.154)

DID x Channel dummy -.021** (.008)

-.052** (.019)

-1.994** (.674)

Failure experiencer -.003* (.001)

-.005* (.002)

-.167** (.064)

Post shock .009*** (.001)

.015*** (.002)

1.672*** (.113)

Channel dummy .678*** (.005)

1.416*** (.012)

27.755*** (.217)

Failure experiencer x Channel dummy

-.015* (.006)

-.020 (.017)

-.360 (.299)

Post shock x Channel dummy

.161*** (.006)

.206*** (.014)

23.602*** (.493)

Intercept .036*** (.001)

.054*** (.002)

1.500*** (.048)

R squared .1398 .0949 .0911 Mean Y .82 1.61 43.31

Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = 546,756. DID = Difference-in-Differences. Channel dummy is 1 for offline purchases and 0 for online purchases.

xii

Figure D1. PRE-PERIOD PURCHASE TRENDS FOR FAILURE EXPERIENCERS AND NON-EXPERIENCERS

(a) Past Frequency of Purchases

(b) Past Quantity of Purchases

(c) Past Proportion of Online Purchases

Note: The unit of X axis is number of days before the failure event.

how does mobile app failure affect purchases in …

Documents