    Most preference-elicitation methods that are used to design products and predict market shares

    (such as conjoint analysis) ask respondents to evaluate product descriptions, mostly online.

    However, many of these products are then sold offline. In this paper we ask how well preference-

    elicitation studies conducted online perform when predicting offline consumer evaluation. To

    that end, we conduct two within-subject conjoint studies, one online and one with physical

    products offline. We find that the weights of the product attributes (partworths) are different in

    the online and offline studies, and that these differences might be considerable.

    We propose a model that captures this change in weights and derive an estimator for offline

    parameters based on the individual respondent’s online parameter, and for population-level

    parameters. We demonstrate that such augmentation of online conjoint data with offline data

    leads to significant improvement in both individual prediction and estimation of population-level

    parameters. We also ask respondents to state their uncertainty about product attributes, and we

    find that while respondents anticipate some of the attributes whose weights change, they

    completely miss others. Thus this bias might not be accurately detected through an online study.

    In 2013, online market research accounted for more than 85% of the $10 billion spent on

    quantitative research in the US (ESOMAR 2014)1. At the same time, overall online sales were

    less than 9% of the $3.2 trillion in total US retail sales (Sehgal 2014). Thus, while most

    consumer products are sold offline, marketing research is mostly done online. The implicit

    assumption is that findings gathered from online marketing research can be used to predict

    offline purchasing behavior. If there are systematic differences between preferences elicited

    online and offline purchase behavior, then these differences may be consequential when firms

    use the results from the research to plan a new product or predict market shares.

    There are a number of behavioral reasons why the evaluation of a physical product may

    differ from the evaluation of its online description, and thus consumers may assign different

    weights to features in the two formats. In general, online and offline channels vary in the types of

    information they convey effectively to consumers and in the consumer’s cost of evaluating them.

    In this paper, we systematically compare consumers’ online and offline product evaluations

    by conducting two within-subject conjoint studies: one online, in which participants evaluate

    product descriptions and pictures, and one offline, with physical products. We chose a messenger

    bag with fully configurable, discrete features as a product that is well suited for a conjoint study.

    We estimated the weights of the product attributes (“partworths”) in a linear compensatory

    model, then compared the partworths obtained from the online and offline formats. Our main

    results are the following:

    1 Qualitative research is still done almost exclusively offline: Of the $3 billion in qualitative research performed in

    the US, 99% is done offline, of which the vast majority is focus groups (ESOMAR 2014).

    Of the ten partworths parameters estimated, eight changed significantly from the online to

    offline studies.

    We propose a method of correcting for this online/offline discrepancy, which is based on

    maximizing the conditional likelihood of the task of interest (offline), conditioned on data

    collected from a different task (online). We show that supplementing online conjoint data

    with offline data leads to significant improvement in both individual prediction and

    estimation of population-level parameters.

    When asked about their uncertainty regarding product attributes, respondents anticipated

    some of the attributes whose weights changed, while completely missing others. Therefore,

    the bias cannot be corrected or even accurately detected through an online study.

    Taking into account the difference between the firm’s online preference elicitation and

    offline purchasing behavior of its customers is important for several reasons: The first and most

    obvious one concerns marketing research such as product development or predicting market

    shares of products that will be sold offline. Research firms prefer online research since offline

    conjoint study is costly (as it might require making physical prototypes) and time consuming (as

    it requires bringing respondents to offline locations). We demonstrate that supplementing a large

    online conjoint study with data from a smaller group of respondents who complete both an

    online and offline study will give approximately the same level of accuracy as a large (and

    costly) offline study. The data from the smaller group will allow a correction of the online/offline

    discrepancies, which can then be applied to the large group.

    Aside from the potentially misleading predictions generated by online marketing research,

    the discrepancy between online and offline consumer choice behavior has implications for mixed

    as well as online retailers, such as Warby Parker and Zappos. Even when shopping online, many

    consumers engage in “researcher shopping”, that is, evaluating the product in a brick-and-mortar

    store before purchasing online (Neslin and Shankar 2009; Verhoef, Neslin, and Vroomen 2007).

    For these research shoppers, this discrepancy remains since they would likely use their offline

    evaluations (partworths). For mixed and online retailers whose consumers make a purchase

    decision online but ultimately evaluate and decide to keep the product based on physical

    evaluation upon receiving it, the discrepancy between the two evaluations can lead to increased

    product returns (Dzyabura and Jagabathula 2014). Understanding the discrepancy will allow the

    retailer to better control for returns of purchased products.

    The paper is organized as follows: in the next section we discuss the background of online

    and offline preference elicitation, and the following section describes the conjoint studies we

    conducted, followed by the model and results. In the subsequent section we propose a correction,

    the Inter-task Conditional Likelihood method, and show that it leads to better out-of-sample

    prediction of the offline evaluation than simply using the online data. We discuss the value of

    using stated uncertainty, and conclude with implications of our study.

    Literature Review

    Researchers have developed various methods for estimating consumer preferences based

    on conjoint studies that ask respondents to rate, rank, or choose among several product “profiles”

    or descriptions of the product’s attributes. In a review article, Netzer et al. (2008) present a

    framework for looking into recent contributions to this important marketing research tool: (1) the

    problem to address; (2) the data collection approach; (3) the estimation of a preference model

    and its conversion into action. In this context, our effort is directed at the latter two components

    of the framework: data collection and the estimation (or correction) of the preference model. In

    these two areas, existing work proposes better data collection and estimation techniques to

    improve the reliability of data collected, as well as estimates of parameters. These new

    techniques include adaptive designs to help avoid respondent fatigue by reducing the number of

    questions (e.g. Toubia, Hauser and Simester 2004; Dzyabura and Hauser 2011); incentive

    compatibility to motivate the participants and improve the validity of responses (e.g. Ding

    2007); Bayesian methods to better account for respondent heterogeneity (e.g. Allenby Arora and

    Ginter 1995); inclusion of subjective attributes (Luo, Kannan, and Ratchford 2008); and

    incorporating non-compensatory decision rules (Gilbride and Allenby 2004; Yee et al. 2007;

    Hauser et al. 2010).

    Several papers introduced the idea of supplementing conjoint estimation with additional

    data that is external to the conjoint study to improve the quality of parameter estimates,

    especially with respect to Bayesian estimation (Yang, Toubia and de Jong 2015, Gilbride, Lenk,

    and Brazzel 2008; Luo, Kannan, and Ratchford 2008; Netzer et al. 2008; Feit, Beltramo, and

    Feinberg 2010; Bradlow 2005; Sandor and Wedel 2005, 2001). Marshall and Bradlow (2002)

    provide a Bayesian approach to combining conjoint data with another data source. In their

    approach, the latter data source is used to form a prior distribution. They demonstrate their

    approach by using respondents’ self-explicated utility weights to form the prior. Along the same

    lines, Dzyabura and Hauser (2011) use a product configurator in conjunction with previous

    survey data to form priors for an adaptive non-compensatory preference-elicitation method. A

    similar approach is taken by Gilbride, Lenk, and Brazzel (2008), who in a choice-based conjoint

    framework argue that for Bayesian estimation, external market shares could be used in order to

    compute the prior distribution for the parameters. The current research follows in the same vein,

    with the offline study providing an additional data source to supplement the online conjoint data.

    One issue that comes up when supplementing conjoint data with data from another source is how

    much weight should be given to the two sources. In our case, the weight given to the offline data

    is proportional to the variance/covariance between the online and offline parameters. For our

    purpose, the offline study serves as “external” data, and its weight is proportional to the

    variance/covariance between the external and internal data.

    We contribute to the field of quantitative preference measurement by proposing a method

    to improve the validity of online conjoint studies in predicting offline purchase behavior. We are

    the first to systematically investigate the role of the medium on conjoint estimates. The vast

    majority of conjoint studies are done on the computer with descriptions of hypothetical products’

    attributes (e.g. Allenby, Arora and Ginter 1995; Ding 2007; Evgeniou, Pontil and Toubia 2007;

    Jedidi and Zhang 2002; Lenk et al. 1996). While there has been some efforts at making online

    conjoint more realistic (Dahan and Srinivasan 2000; Berneburg and Horst 2007), conjoint

    literature has not explicitly evaluated whether the typical task format—in which product

    descriptions are shown to the consumer as attribute descriptions on the computer—is

    representative of the way a respondent would behave if evaluating the physical product with the

    same features.

    The full description of the conjoint studies is presented next.

    Experimental design

    In order to systematically compare online and offline product evaluations, we conducted

    two within-subject conjoint studies: one online, in which participants evaluate product

    descriptions, and the other offline, with physical products. A firm that is considering launching a

    new product or a new version of an existing one could use this framework at a prelaunch stage

    with prototypes.

    The choice of the “right” product is important since we wish to have a product that is

    configurable, with discrete attributes, and with just the right price so that the subjects would pay

    full attention to their choice. Timbuk2 messenger bags were chosen for the following reasons:

    (1) they vary on discrete features, some of which are “touch and feel” features for which we

    might expect to see discrepancy between online and offline evaluations; (2) they are fully

    configurable, which allowed us to purchase bags with the aim of creating a balanced orthogonal

    design for the physical conjoint; (3) they are in the right price range, such that they are expensive

    enough for participants to take the decision seriously, but cheap enough that undergraduate

    students might be interested in purchasing them; (4) they are infrequently purchased, such that

    we can expect that many participants would not be familiar with some of the attributes and not

    have well-formed preferences; and finally (5) they are physically small enough for us to be able

    to conduct the study in the behavioral lab.


    Timbuk2’s website offers a full customization option that includes a number of features

    ( We selected a subset of attributes that we expected to be

    relevant to the target population and for which there is likely to be some uncertainty on the part

    of consumers and respondents. For example, we excluded the Right- or Left-Handed Strap

    option since respondents would not have any uncertainty with respect to being left- or right-

    handed. In addition, we combined the five color features into one Exterior Design feature that

    has four options. To make the study manageable we reduced the number of levels of some of the

    features. We therefore have the following six attributes for the study:

    - Exterior design (4 options): Black, Blue, Reflective, Colorful - Size (2 options): Small (10 x 19 x 14 in), Large (12 x 22 x 15 in)

    - Price (4 levels): $120, $140, $160, $180 - Strap pad (2 options): Yes, No - Water bottle pocket (2 options): Yes, No - Interior compartments (3 options): Empty bucket with no dividers, Divider for files,

    Padded laptop compartment

    Since we chose the price variable to be continuous, we have a total of 13 discrete attribute

    levels for the rest of the attributes. Since we set the default for the dummy variables to zero

    (black color, small size, no strap pad, no water bottle pocket, and empty bucket), we’re left with

    10 parameters to be estimated (8 discrete, one continuous, and a constant). Using the D-optimal

    study design criterion (Kuhfeld, Tobias and Garratt 1994; Huber and Zwerina 1996), we selected

    a 20-product design that has a D-efficiency of 0.97.


    We recruited 122 participants from a university subject pool where respondents signed up

    for an individual time slot. Because one of the two studies involved looking at physical bags,

    only one person could participate at a time in order to avoid preferences affecting each other.


    To ensure incentive compatibility and promote honest responses, participants were told

    by the experimenter that they would be entered in a raffle for a chance to win a free messenger

    bag. Were they to win, their prize would be a bag that was configured to their preferences, which

    the researchers would infer from the responses they provided in the study. This chance of

    winning a bag provides incentive to participants to take the task seriously and respond truthfully

    with respect to their preferences (Ding 2007, Toubia et al. 2012). We followed the instructions

    used by Ding et al. (2011) and told participants that, were they to win, they would be given a

    messenger bag plus cash, which together would be valued at $180. The cash component was

    intended to eliminate any incentive for the participants to provide higher ratings for more

    expensive items, in order to win a more expensive prize. Respondents were paid $7 to complete a

    30-minute study, plus their chance to win an incentive-aligned prize discussed above. All 122

    participants completed the study, that is, completion rate was 100%, not unreasonable for a lab


    Conjoint task:

    We used a ratings-based task in which respondents rated each bag on a 5-point scale

    (Definitely not buy; Probably not buy; May or may not buy; Probably buy; Definitely buy). We

    chose a ratings-based task rather than a choice-based task because the latter is much more

    complex logistically with physical products, in a study already complex to the participants. Even

    when conducting a conjoint study online, choice tasks take as much or more time than ratings

    tasks (Huber, Ariely and Fischer 2002; Orme, Alpert and Christensen 1997), and produce less

    information than individual product rating tasks (Moore 2004). Conducting choice tasks offline

    would be even more time consuming as the experimenter would have to present the respondent

    with a set of bags, ask the respondent to choose, then present another set of bags, and so on. For

    a comprehensive comparison of ratings and choice based conjoint analysis models, see Moore


    Online task:

    The online task was conducted using Sawtooth Software. The first screens walked the

    participants through the feature descriptions one by one. After that, respondents were shown a

    practice rating question and were informed that this is for practice and their response to the

    question would be discarded. The following screens presented a single product configuration,

    along with the 5-point scale, and one additional question that was used for another study. An

    example screen shot is shown in Figure 1a. Participants could go back to previous screens if they

    wanted but could not skip a question. Lastly, participants were asked to rate each of the 13

    features with respect to what degree they felt they would need to examine a product with this

    feature to be able to evaluate it. This was measured on a sliding scale marked “Definitely do not

    need to see” to “Definitely need to see”, which correspond to 0 and 100, respectively.

    Figure 1a: Sample online conjoint screen shot

    Figure 1b: Offline task room setup

    Offline task:

    The offline task was conducted in a room separate from the computer lab in which the

    online task had been conducted to ensure that participants could not see the bags while

    completing the online task. This task was done individually, one respondent at a time in the

    room, so as to avoid a contagion effect. The bags were laid out on a conference table, each with a

    card next to it displaying a corresponding number (indexing the item), and the bags were

    arranged in the order 1 through 20 (see Figure 1b). The prices were displayed on stickers on

    Timbuk2 price tags attached to each bag. The experimenter walked the respondents through all

    the features, showing each one on a sample bag.


    In order to investigate whether participants’ preferences differ with the online and offline

    formats, we allow the partworths to vary by respondent, feature, and format. We use the

    following standard specification2 for each individual’s rating of each product in each format:

    (1) ∑ ,

    where is the rating provided by participant to bag in task format (online or offline);

    is the partworth assigned by participant to feature j in task format ; and is the intercept.

    Product k is captured by its (J) attribute levels where all the attributes are coded as binary

    dummy variables except for the continuous price variables. To capture consumer heterogeneity,

    we fit a linear mixed effects (LME) model to the ratings data. That is, we assume that a

    respondent’s individual partworths are drawn from a multivariate normal distribution:

    2 For example, Green and Srinivasan 1990, Huber 1997, Huber, Ariely and Fischer 2002, Kalish and Nelson 1991.

  • 11

    ( )

    [ ]

    To allow for heterogeneity among consumers, we have to estimate the elements of the main

    diagonal of , which correspond to , capturing the population variance of partworths of each

    feature. Because a key construct in this paper is the correlation between a respondent’s online

    and offline partworth for the same feature, we also estimate ( ) for all j.

    Since the full matrix is of an order of magnitude of J2

    (400 in our case), and since we do not

    expect a correlation among different features, we fix at zero the elements of that correspond to

    ( ) for . Thus we assume that the covariance matrix has

    the following structure:








    ( )

    ( )

    ( )



    We estimate the LME in equation (1) using maximum likelihood, and use these estimates

    for the remainder of the paper.3 The estimates of all features’ fixed effects, (that is, the

    population average feature partworths) are reported in Table 1. The estimates of the population

    3 Note that while choice based conjoint traditionally requires more complex methods, such as MCMC to estimate the

    choice models, ratings tasks can be estimated using classical methods.

    partworth variance, , and online-offline correlations, ( )

    , are reported in

    Table 2.

    Table 1: Mean population partworths ( )



    Online Partworth

    Offline Partworth


    Exterior design Reflective -0.31** -0.60** -0.28*

    Colorful -1.06** -0.71** 0.36**

    Blue -0.22** -0.11 -0.12


    Size Large 0.27** -0.31** -0.58**


    Price $120, $140, $160, $180 -0.22** -0.15** 0.06**

    Strap pad Yes 0.51** 0.25** -0.26**


    Water bottle pocket Yes 0.45** 0.17** -0.28**


    Interior compartments Divider for files 0.41** 0.52** 0.11

    Crater laptop sleeve 0.62** 0.88** 0.26**

    Empty bucket/no dividers

    Intercept 3.72** 3.39** -0.33


    Because the partworth of one level of each attribute is normalized to zero, these values

    always represent a comparison to the default level. For example, the negative values of the

    exterior designs signify that, at the population level, Black is the preferred design.

    To appreciate the magnitude of these differences, we calculated the willingness to pay for

    the attributes using the methodology of Ofek and Srinivasan (2002). The resultant median

    willingness to pay for Strap pad is $43 online and $31 offline; For Water bottle pocket the WTP

    is $40 online and exactly half ($20) offline. These represent considerable differences if the firm

    is to base its pricing on these findings.

    Large population standard deviations signify a great deal of heterogeneity among the

    respondents in their preference for the attribute. For example, we can see that there is large

    variation in respondents’ preference for Colorful, while there is a relative consensus on Strap

    Pad. Also note that the preferences for Reflective and Colorful are more heterogeneous offline

    than online. The value of the correlation is a measure of how systematic the bias is. If the

    correlation is high, it suggests that if there is an online/offline discrepancy, it is systematic across

    respondents. In the extreme case, every respondent’s online partworth estimate would differ from

    its offline counterpart by a constant. If the covariance is low, then a respondent’s online

    partworth is not a good predictor of her offline partworth.

    Our first main result is that the population-level estimates of most features differ by task

    format and some are large, suggesting a systematic bias that is being introduced by using online

    preference elicitation. This is a major issue if the aim is to make predictions in the offline

    environment based on online market research. Both aggregate-level predictions such as market

    shares and individual predictions such as segmentation or targeting that are based on online

    preference elicitation would be incorrect.

    Several attributes’ partworths changes are worth noting:

    - The single attribute that did not change significantly from the online to offline

    scenario is the color Blue. This is likely because the Blue color can be accurately

    evaluated based on the image provided in the online task, and Color is a very salient

    attribute in both conditions.

    - The decrease of the partworths of Water bottle pocket and Strap pad is substantial.

    This may be attributed to those attributes being made more salient in the online

    condition, when they are stated verbally; offline, on the other hand, they may be

    overlooked by participants altogether.

    - Only one attribute, Size, changed sign. Thus online respondents preferred the larger

    bag but changed their preference to the smaller one once they physically examined

    the bags.

    - The fact that the intercept’s value does not change implies that there is no feature that

    changes upon physical examination that is common to all the bags, such as the

    material used.

    Since our first main result indicates a substantial online discrepancy in the estimates of the

    parameters of interest, we next propose a method to correct for this bias to improve predictions

    about consumer offline purchase behavior.

    Improving predictions of offline purchase behavior

    When conducting marketing research with the purpose of product design or market share

    prediction, firms typically conduct online conjoint studies with large representative samples of

    participants, but with the intent of predicting purchase behavior that will take place offline. Our

    results in the previous section provide strong evidence of a discrepancy in partworths

    measurement between online and offline evaluations. Consequently, if the product is sold

    primarily in brick-and-mortar stores, an online only conjoint study is not sufficient and an offline

    conjoint task is required to obtain more accurate predictions. However, conducting large offline

    conjoint studies is costly because it involves evaluation of physical products as opposed to online

    descriptions. There are few exceptions such as Luo, Kannan and Ratchford 2008, and She and

    Macdonald 2013.

    We propose to address the above challenge by (a) supplementing a large online conjoint

    study with a sample of respondents who complete both the online and offline tasks and (b)

    designing a correction that we term the Inter-task Conditional Likelihood correction (ICL),

    which uses the supplemented data to improve the offline purchase behavior of the respondents.

    We will trade-off the accuracy of our predictions with the cost of data collection by asking a

    small number of the respondents from a large conjoint task to complete both the online and

    offline conjoint tasks. The number of respondents chosen will allow us to determine the trade

    off. The structure of the resulting data set is illustrated in Figure 2, where the shaded area

    corresponds to the observed data and the un-shaded area corresponds to the missing data that are

    of interest.

    Figure 2: Data split into estimation and prediction

    Training Sample (50%) Hold-out Sample (50%)

    Online data used for estimation

    Offline prediction task

    We use the data from the respondents who completed both the online and offline conjoint

    tasks to infer the correlations between the online and offline partworths. We then use these

    correlations to infer the missing offline ratings for the other respondents. In order to carry out the

    inference, we design the ICL correction, based on maximizing the conditional likelihood of the

    task of interest (offline), conditioned on data collected from a different task (online). It is

    designed to exploit the correlations that exist between the online and offline partworths, and

    therefore, its success depends on the extent of the correlation. We illustrate the ICL method on

    the data from the conjoint task described above. Our results demonstrate that the ICL method

    works well for (1) the prediction of individual-level offline ratings and (2) the accurate

    estimation of the population preference distribution. A key result from our experiments is that

    collecting offline and online data for a small number of respondents can result in substantial

    improvements in the prediction accuracies.

    (1) Individual-level offline ratings prediction

    We illustrate our method in the context of the Timbuk2 conjoint study described above.

    Because we have collected within-subject data on both the online and offline ratings, we are able

    to hold out the offline ratings for a subset of respondents (validation set) and to predict them by

    applying the ICL to their online data. To that end, we hold out the offline ratings for a subset of

    our sample, , and predict them using the model trained on just these respondents’

    online data. We use the remainder of the sample, , to train our model of online and

    offline partworths, detailed below, which results in an individual-level estimate of the offline

    partworth, ̂ , for respondent i and attribute j. We demonstrate that the proposed ICL

    correction significantly improves out-of-sample prediction.

    We consider two prediction tasks: (a) offline ratings for the 20 bags and (b) individual-

    level offline partworths. Depending on the application, one or the other prediction task may be

    more relevant. For instance, if the goal were to predict market shares, then the ability to

    accurately predict offline ratings is more valuable, whereas if the goal were to segment the

    market based on consumer preferences, then accurately predicting partworths is of greater value.

    We quantify the prediction gains from our techniques in terms of the popular root mean square

    error (RMSE) metric. We report the RMSE metrics for the two prediction tasks: for

    the offline ratings prediction task and for the partworths prediction task. The

    metrics are defined as follows:

    ∑ √∑( ̂ )

    | |

    ∑ √∑( ̂ )

    | |

    where is the set of respondents used for validation, and ̂ is the predicted

    offline rating of respondent i for bag k computed according to the individual-level estimate of

    the offline partworth, ̂ , for respondent i and attribute j.

    Table 3 compares the performance of the ICL method against three different benchmarks,

    described below. The reported metrics were averaged over 50 random partitions of the data set

    into 50% training and 50% validation sets to remove any data partitioning effects. We first

    describe the three different benchmark methods and then describe the ICL method in detail.

    Table 3: Out-of-sample predictive performance (lower is better)

    ICL Online data Partial offline data Full offline data

    1.04 1.56 1.18 0.51

    0.07 0.21 0.24 0

    Full offline data: The first benchmark method provides a lower bound (or the best

    performance) on the RMSE metric we can expect. For each individual in the validation set, we

    set ̂ and ̂ ∑ , where

    is obtained by running ordinary least squares (OLS) method on the offline ratings data. It can be

    seen that and is the lowest RMSE we can get with a linear

    model. This method uses data not part of our training sample, but provides us with the best

    performance we can hope to achieve. Table 3 reports for this method.

    Online only data. This method captures the current practice: regardless of the channel in which

    the product is to be sold (offline or online), the research is conducted online. For each

    respondent, we predict her offline bag ratings using partworths trained on her online conjoint

    task., i.e., we use ̂ This results in predicted offline ratings ̂ ,

    which are computed according to online estimated partworths :

    ̂ ∑

    Because this method corresponds to the way conjoint studies are typically conducted, it is a

    reasonable baseline model. We note from Table 3 that the RMSE metrics of this method are

    and – significantly higher than the lower bound

    from the full offline data. Further, it is clear from the results in Table 3 that we can get

    significant improvements – 33% and 67% for ratings and partworths predictions respectively –

    over the current practice from incorporating offline data through the ICL method. These

    improvements quantify the benefits of our method over the current practice.

    Partial offline data: For this method, we used only the offline ratings in the training data.

    Specifically, we trained a linear mixed effects model as described in the section above but only

    on the offline ratings collected from the participants in the training data. We then used the

    estimated population-level parameters as our estimates for the individual-level offline partworths

    i.e.., ̂ . These partworths were then used to predict the offline ratings for the

    participants in the test data set as follows:

    ̂ ∑

    Note that because we are not using the online observations of the participants, our

    partworth and ratings predictions for each product are the same across all the participants. We

    note from Table 3 that the RMSE metrics are and .

    We observe that the RMSE for the ratings prediction is significantly (about 24%) lower when

    compared to the current practice (the online only data benchmark). This finding is indeed very

    surprising because it suggests that predicting offline ratings using population-level parameters

    and no individual-level information can, on average, be more accurate than actually asking the

    participants for their online ratings! This finding reveals the level of discrepancy that exists

    between the online and offline partworths in the messenger bag setting. Further, it suggests that

    the firm may in fact obtain a more accurate understanding of customer’s offline purchase

    behavior from a small offline conjoint rather than only a large online conjoint.

    We also note that despite outperforming the current practice, the RMSE metrics are still

    higher – about 12% and 71% for ratings and partworths predictions respectively – than the ICL

    correction method, suggesting that a combination of online and offline conjoint data can

    outperform either only offline or online data.

    ICL correction: The ICL method computes the expected offline ratings conditioned on all the

    observed data. We exploit the properties of multivariate normal distributions in order to compute

    the conditional expectations in closed form. Specifically, recall that we assume that each

    respondent samples the online and offline partworths , and , jointly from a

    bivariate normal distribution:

    ( )


    ] [

    ] [ ( )


    ) ]

    where ( ) is the covariance between the online and offline partworths of

    attribute j.

    We use observed data to determine the maximum likelihood estimates of the population-

    level parameters , , and ( ) for each

    attribute j. Note that the data from the group of respondents who completed both the online and

    offline tasks allows us to infer the covariance parameters. Given the population-level parameters,

    we can show that the conditional distribution of given is a normal

    distribution, with mean | and variance | that are given by


    ( )

    | √

    where is the correlation between the online and offline partworths of attribute j.

    Note that | is also the maximum likelihood estimator of , because

    of normality. Therefore, under this model, the maximum likelihood estimates of a respondent’s

    offline partworths are given by:


    ( )

    Because we have a closed-form expression for the correction, computing it is straightforward

    once the population-level parameters have been estimated. Using the ICL correction on our data,

    we obtain the of 1.04, and is 0.07. These are substantial

    improvements over the uncorrected methods.

    (2) Estimation of population-level parameters

    We now investigate the ability of our proposed method to obtain accurate estimates of

    population-level parameters. As above, we compare the performance of the ICL method to the

    three different benchmarks. Population-level partworth parameters allow us to gain an intuitive

    understanding of the population’s preference for a certain attribute (e.g., Allenby et al. 2014).

    We assume that individual partworth vectors are drawn from a multivariate normal distribution,

    as described above. Under this assumption, we determine the maximum likelihood estimates of

    the population mean and variance of the partworth distribution for each attribute for each of the

    three different benchmark methods and the ICL correction method. We quantify the difference in

    their performances using the Kullback-Leibler (KL) divergence metric (Kullback and Leibler

    1951). Specifically, using the full offline data, we compute the ground-truth estimates of the

    mean and variance parameters. We then compute the KL divergence between the distributions

    estimated using each of the methods and the ground-truth distributions obtained from the full

    offline data. The KL-divergence metrics are reported in Table 4. As is clear from the table, the

    estimates resulting from the ICL are close to the ground-truth full offline data estimates,

    consistently across all the attributes lending more support to our proposition of supplementing

    online data with the offline data. Particularly, the estimates obtained by ICL significantly

    outperform those obtained from online only, on all attributes except for the color Blue.

    Since the offline conjoint task is challenging and costly to conduct, we explore whether it

    can be avoided altogether. Specifically, consumers might have intuitions about which attribute

    preferences have a greater possibility of changing from the online to the offline environment. We

    tackle this issue next.

    Offline estimates ICL corrected estimates Online only estimates

    Mean Variance Mean Variance



    ICL) Mean Variance




    Exterior design Reflective -0.6 0.7 -0.61 0.72 0.010 -0.31 0.29 0.146

    Colorful -0.71 0.91 -0.7 1.03 0.007 -1.06 0.88 0.073

    Blue -0.11 0.25 -0.11 0.16 0.043

    -0.22 0.16 0.042


    Size Large -0.31 0.28 -0.3 0.31 0.022 0.27 0.18 1.021


    Price $120- $180 -0.15 0 -0.16 0 0.081 -0.22 0 0.421

    Strap pad Yes 0.25 0.15 0.27 0.14 0.011 0.51 0.11 0.212


    Water bottle pocket Yes 0.17 0.05 0.15 0.05 0.068 0.45 0.07 0.619


    Interior compartments Divider for files 0.52 0.04 0.48 0.05 0.065

    0.41 0.06 0.149

    Crater laptop sleeve 0.88 0.2 0.83 0.31 0.018 0.62 0.25 0.143

    Empty bucket

    Intercept 3.55 3.44 3.72

  • Stated Uncertainty

    It is possible that consumers are aware that they cannot judge the value of certain attributes

    with accuracy. If the online/offline discrepancy occurs to people many times in different

    categories, it is possible that consumers have learned to anticipate changing personal preferences.

    In that case, we can improve our decision making by asking consumers to self-state the need to

    examine each attribute physically in order to accurately judge it. Note that this consumer belief

    uncertainty is different from magnitude of the variance around the parameter estimate, which

    represents the researcher uncertainty.

    In the online task, after rating all the bags, participants were asked to state their certainty

    about how well they could judge each feature from the online description. The exact wording of

    the question was: “Some of the bag features may be clear to you simply from the description

    provided online. Other features you may want to physically examine before making your final

    decision. Please rate the following features on how useful it would be for you to examine a

    product with this feature in a store.” Each feature was then listed, with a sliding scale ranging

    from “Definitely don’t need to see in store” to “Definitely need to see in store”, which

    corresponded to uncertainty ratings of 0 and 100, respectively. Price was excluded. While these

    scales are not an objective measure of variance and the quantities reported should not be

    interpreted in isolation, their relative values are meaningful. Comparing stated uncertainty with

    the (absolute) difference between online and offline partworths, we find that stated uncertainty is

    a rather poor predictor of the changes that occur. Population averages for the stated uncertainties

    are given in Table 5, along with the corresponding features’ online-offline partworth differences.

    Note that uncertainty was measured for all attribute levels, while partworths were normalized to

    zero for one of the attribute levels.

    It is clear that while participants can anticipate some of the attributes that will change, such

    as Size and the Laptop sleeve, they miss others, such as Water bottle pocket, Strap pad, and

    Colorful. Indeed, the correlation of the stated uncertainty and the absolute value of the difference

    (for the features for which we have the difference) is rather low (36.8%). Moreover, this

    correlation is driven almost exclusively by size: If we compute the correlation of all attributes

    excluding size, the resultant correlation disappears ( ).

    Table 5: Stated uncertainty and online/offline discrepancies

    Attribute Level



    Difference in


    Exterior design Reflective 49.6 -0.30

    Colorful 49.5 0.36

    Blue 42.4 0.08

    Black 43.5

    Small Large 75.7 -0.54

    Small 75.4

    Strap pad Yes 58.5 -0.29

    No 30.3

    Water bottle pocket Yes 45.6 -0.29

    No 24.4

    Interior compartments Divider for files 63.0 0.14

    Crater laptop sleeve 70.2 0.25

    Empty bucket/no dividers 30.9

    Conclusions and Implications

    In this work we challenged the implicit assumption commonly made in market research

    that findings collected from online research can be used to accurately predict offline behavior.

    Consumers’ product evaluations from an online conjoint study with verbal product descriptions

    as well as pictures were compared to an offline study with physical products. We found that the

    vast majority of partworth parameters changed significantly from online to offline studies. To

    correct for this disparity, we offered a method based on maximizing the likelihood of the offline

    task, conditional on data collected from the online task. We showed that this estimator leads to

    better out-of-sample prediction than using uncorrected online data.

    In this paper we used primary data in order to carefully control for all factors and zero in on

    the online/offline distinction. But the higher-level problem of predicting a consumer’s offline

    preferences, given the same consumer’s online preferences, and other consumers’ online and

    offline preferences has implications beyond online preference elicitation. Consider an online

    retailer, such as Warby Parker, Zappos, or Bonobos. When consumers purchase from these

    retailers, they decide what to order based on their online evaluation of the available items.

    However, once they receive their order, they determine what they want to keep based on physical

    evaluation. These and other online retailers typically have a very generous returns policy, so that

    customers may try on several items before purchasing one. Warby Parker even offers a free

    “Home Try-On” program in which customers may order several eyeglass frames to try at home,

    return all, and then order the prescription lens to go with the chosen frames. Thus, the firm has

    some data on both online and offline preferences for customers who have a history with Warby

    Parker. When a potential new customer (who has not yet evaluated the firm’s products

    physically) orders some items, the firm knows only the online behavior. In a sense the retailer

    has more information than the single consumer. In addition to this website visitor’s online-

    preference data, the retailer can use for estimation all the information gathered from current

    customers, including online and offline preferences, to obtain an estimate on the offline

    preference of the new customer. Note that the data scheme given in Figure 3 is similar in nature

    to the one used in the data split in Figure 2, since the data used for estimation and prediction are

    broken along the exact same lines.

  • 27

    Figure 3: Schematic data available to a typical online retailer

    Existing customers New customers

    Online data used for estimation

    Offline prediction task

    In the area of recommendation systems, a common problem structure is that a substantial

    amounts of data are available on some customers, while only very sparse data are available on

    others, and using the former to improve prediction about the latter. In our case, the missing piece

    is the offline product evaluation. As we have demonstrated, relying only on a customer’s online

    preferences to make predictions about his or her offline preferences may be unreliable. However,

    having access to both types of preferences for the existing set of customers enables the retailer to

    make a better prediction. An online retailer, through programs such as Warby Parker’s Home

    Try-On Program, may implement an offline recommendation system by including some

    suggested items for a user who is ordering online.

