nonparametric learning and optimization with covariatesnonparametric learning and optimization with...

Nonparametric Pricing Analytics with CustomerCovariates

Ningyuan Chen∗1 and Guillermo Gallego†2

1Rotman School of Management, University of Toronto2Department of Industrial Engineering & Decision Analytics

The Hong Kong University of Science and Technology

Abstract

Personalized pricing analytics is becoming an essential tool in retailing. Uponobserving the personalized information of each arriving customer, the �rm needsto set a price accordingly based on the covariates such as income, educationbackground, past purchasing history to extract more revenue. For new entrantsof the business, the lack of historical data may severely limit the power andpro�tability of personalized pricing. We propose a nonparametric pricing policyto simultaneously learn the preference of customers based on the covariates andmaximize the expected revenue over a �nite horizon. The policy does not dependon any prior assumptions on how the personalized information a�ects consumers’preferences (such as linear models). It is adaptively splits the covariate space intosmaller bins (hyper-rectangles) and clusters customers based on their covariatesand preferences, o�ering similar prices for customers who belong to the samecluster trading o� granularity and accuracy. We show that the algorithm achievesa regret of order O(log(T )2T (2+d )/(4+d )), whereT is the length of the horizon and dis the dimension of the covariate. It improves the current regret in the literature

∗[email protected]†[email protected]

1

arX

iv:1

805.

0113

6v3

[cs

.LG

] 1

6 Fe

b 20

20

(Slivkins, 2014), under mild technical conditions in the pricing context (smoothnessand local concavity). We also prove that no policy can achieve a regret less thanO(T (2+d )/(4+d )) for a particular instance and thus demonstrate the near optimalityof the proposed policy.

Keywords: multi-armed bandit, dynamic pricing, online learning, regret analysis,contextual information

1. Introduction

Personalized pricing refers to the practice that a �rm charges customers di�erentprices for the same product, depending on customers’ information such as educationbackgrounds and zip codes. It is increasingly popular in online retailing, as sellers canacquire/infer the personalized information from customers’ account pro�les or browsinghistories (cookies). The demand (purchasing probability) of each customer dependsnot only on the price, but also on the personalized information. The �rm observes theinformation of each arriving customer and sets a personalized price accordingly. We areinterested in �nding pricing policies of the �rm that maximize the long-run revenue.

Personalized pricing presents several challenges to the �rm. First, for new entrantsto the online business, the demand function and how it depends on the personalizedinformation is generally unknown. Thus, the optimal pricing cannot be obtained bydirectly solving an optimization problem. The �rm may experiment with di�erentprices to learn the personalized demand function, and then sets optimal prices accord-ing to the estimation. There is usually a �nite horizon that forces a trade-o� betweengathering more information (learning/exploration) and making sound decisions (earn-ing/exploitation). This problem, sometimes referred to as the learning/earning dilemma,has attracted the attention of many researchers.

A second challenge is the presence of personalized information, or covariates. On onehand, the covariate of each customer provides extra information for the �rm to predictthe personalized demand more accurately. On the other hand, the demand is peculiar toeach instance and changes over time, which adds signi�cant complexity to the learningproblem described above. In particular, learning a market-wise demand function is notsu�cient, and the �rm has to learn the personalized demand by experimenting with

2

prices for customers of similar pro�les.A third challenge is the selection of a predictive model to estimate the demand.

Consider a toy example: the demand function only depends on the address of eachcustomer and nothing else. The �rm may postulate a linear model

demand = a − b × price + c · (latitude, longitude),

and uses historical sales data to learn the parameters a, b and c and maximize revenueaccording to the estimation. However, whether the model is speci�ed correctly playsan important role in the performance of the pricing policy. In the particular example,linearity in the location (latitude and longitude) implies that the customers alongthe straight line that is orthogonal to c have the same demand. This hardly re�ectsthe reality, as customers from the same neighborhood tend to have similar shoppingpatterns and neighborhoods are usually clustered geographically. By postulating aparametric (linear) model, the �rm faces the risk of misspeci�cation and not learningwhat is supposed to be learned. A nonparametric model is more appropriate in thissetting.

In this paper we study the pricing policy of a �rm which tries to maximize theunknown expected revenue f (x ,p) , pd(x ,p), where p is the price, d(x ,p) ∈ [0, 1]is the personalized demand function for a customer of covariate x . We assume bothquantities are normalized so x ∈ [0, 1)d and p ∈ [0, 1]. In period t , the �rm observesan arriving customer with a random covariate Xt , and sets a personalized price pt .The earned revenue is pt multiplied by a Bernoulli random variable with success rated(Xt ,pt ), representing the event of a purchase. The expected revenue is thus f (Xt ,pt ).

We propose a nonparametric learning policy for the �rm. That is, the policy doesnot depend on speci�c forms of f (x ,p) and only assumes general structures such ascontinuity and smoothness. The policy achieves near-optimal performance comparedto a clairvoyant who knows f (x ,p) and sets p∗(x) = argmaxp f (x ,p) for a customerof covariate x . More precisely, the expected di�erence in total revenues between theproposed policy and the clairvoyant policy, which is referred to as the regret in theliterature, grows atO(log(T )2T (2+d)/(4+d)) asT →∞. The rate is sublinear inT , implyingthat when the length of horizon tends to in�nity, the average regret incurred per periodbecomes negligible. Moreover, we prove that no pricing policies can achieve a lower

3

regret than O(T (2+d)/(4+d)) for a reasonable class of unknown objective functions f .Therefore, we successfully work out the learning/earning dilemma with covariates.

The main contribution of the paper is the design of a near-optimal nonparametriclearning policy for the personalized pricing problem. Nonparametric learning policiesare introduced in more general settings by Rigollet and Zeevi (2010); Perchet andRigollet (2013); Slivkins (2014). The formulation in this paper is originally introducedin Slivkins (2014), and our policy builds upon the idea of adaptive binning proposed inPerchet and Rigollet (2013)1. Motivated by the application of personalized pricing, weassume that the expected revenue f (x ,p) is smooth and locally concave in the chargedprice p. This deviates from the Lipschitz continuous condition in Slivkins (2014) andcan be viewed as a special “margin condition” in Rigollet and Zeevi (2010); Perchetand Rigollet (2013). By utilizing this condition, we are able to show that our policy isnear-optimal and achieves improved regret over merely continuous objective functions(T (2+d)/(4+d) versus T (2+d)/(3+d) in Slivkins (2014)).

1.1. Literature Review

This paper is motivated by the recent literature that analyzes a �rm’s pricing problemwhen the demand function is unknown (e.g. Besbes and Zeevi, 2009; Araman andCaldentey, 2009; Farias and Van Roy, 2010; Broder and Rusmevichientong, 2012; den Boerand Zwart, 2014; Keskin and Zeevi, 2014; Cheung et al., 2017). den Boer (2015) providesa comprehensive survey for this area. Since the �rm does not know the optimal price, ithas to experiment di�erent (suboptimal) prices and update its belief about the underlyingdemand function. Therefore, the �rm has to balance the exploration/exploitation trade-o�, which is usually referred to as the learning-and-earning problem in this line ofliterature. Our paper considers the pricing problem with personalized information andit does not consider the �nite-inventory setting as in some of the papers mentionedabove.

More recently, several papers investigate the pricing problem with unknown demandin the presence of covariates (Nambiar et al., 2016; Qiang and Bayati, 2016; Javanmardand Nazerzadeh, 2016; Cohen et al., 2016; Ban and Keskin, 2017). The existing literaturehas adopted a parametric approach: for example, the actual demand can be expressed in

1Perchet and Rigollet (2013) study discrete decision variables (multi-armed bandit).

4

a linear form αTx + βTxp+ϵ (Qiang and Bayati, 2016; Ban and Keskin, 2017), where x isthe feature vector of a customer, α and β are vectorized coe�cients, and ϵ is the randomnoise. Because of the parametric form, a key ingredient in the design of the algorithmsin this line of literature is to plug in an estimator for the unknown parameters (α andβ) in addition to some form of forced exploration. In contrast, we focus on a settingwhere the demand function cannot be parametrized. Thus, the �rm cannot count onaccurately estimating the function globally by estimating a few parameters. Instead, alocalized optimal decision has to be made based on past covariates generated in theneighborhood. It highlights the di�erent philosophies when designing algorithms forparametric/nonparametric learning problems with covariates. As a result, the bestachievable regret deteriorates from O(

√T ) or O(logT ) (parametric) to O(T (2+d)/(4+d))

(nonparametric).The dependence of the optimal rate of regret on the problem dimension d has been

observed before. For example, Cohen et al. (2016) �nd a multi-dimensional binarysearch algorithm for feature-based dynamic pricing, which has regret O(d2 log(T /d));Javanmard and Nazerzadeh (2016) propose a policy for a similar problem that achievesregret O(s logd logT ), where s represents the sparsity of the d features; in Ban andKeskin (2017), the near-optimal policy achieves regret O(s

√T ). In their parametric

frameworks, the dependence of the regret on d is rather mild—it does not appear onthe exponent of T ; Javanmard and Nazerzadeh (2016); Keskin and Zeevi (2014) alsoprovide methods to deal with the sparse strucutre. In contrast, in our nonparametricformulation, the optimal rate of regretO(T (2+d)/(4+d)) increases dramatically ind , makingthe problem signi�cantly harder to learn in high dimensions. This is similar to thenonparametric formulation in the network revenue management problem (Besbes andZeevi, 2012), in which the dimension of the decision space is d and the optimal rate ofregret is O(T (2+d)/(3+d))2. From the literature, it seems that the dimension signi�cantlycomplicates the learning problem in a nonparametric formulation.

This paper is also related to the vast literature studying multi-armed bandit problems.See Cesa-Bianchi and Lugosi (2006); Bubeck and Cesa-Bianchi (2012) for a compre-hensive survey. The classic multi-armed bandit problem involves �nite arms, and thealgorithms (Kuleshov and Precup, 2014; Agrawal and Goyal, 2012) cannot be applied

2It is shown in Chen and Gallego (2018) that if the number of inventory constraints� d , then learningthe dual variables may e�ectively reduce the problem dimension.

5

directly to our setting. Recently, there is a stream of literature studying the so-calledcontinuum-armed bandit problems (Agrawal, 1995; Kleinberg, 2005; Auer et al., 2007;Kleinberg et al., 2008; Bubeck et al., 2011), in which there are in�nite number of arms(decisions). Although there is no contextual information in those papers, Kleinberget al. (2008); Bubeck et al. (2011) have developed algorithms based on a similar idea todecision trees, because the potential arms form a high-dimensional space.

For multi-armed bandit problems with contextual information, parametric andregression-based algorithms have been proposed in, for example, Goldenshluger andZeevi (2013); Bastani and Bayati (2015). Our paper is related to the literature studyingcontextual multi-armed bandit problems in a nonparametric framework Yang et al.(2002); Langford and Zhang (2008); Rigollet and Zeevi (2010); Perchet and Rigollet(2013); Slivkins (2014); Elmachtoub et al. (2017). The analysis builds upon the ideaof adaptive binning in Perchet and Rigollet (2013). Our algorithm is designed forcontinuous decisions. In fact, applying the algorithm in Perchet and Rigollet (2013)designed for discrete decisions to our problem with simple discretization may result inworse-than-optimal regret. In terms of the formulation and the rate of regret, this paperis closely related to Slivkins (2014). Slivkins (2014) investigates a more general model, inwhich the decision p can be a vector, and assumes that f (x ,p) is Lipschitz continuous.The optimal rate of regret in this setting has been shown to be T (1+dx+dp )/(2+dx+dp ) inprevious works, where dx and dp are the dimensions of x and p, respectively. Slivkins(2014) introduces an adaptive zooming algorithm, and uses the covering dimension ofthe space (x ,p) in the analysis. The extension accommodates more general spaces of(x ,p) than the Euclidean space. For Euclidean spaces, the algorithm recovers the optimalrate of regret T (1+dx+dp )/(2+dx+dp ). In our setting, letting dx = d and dp = 1 leads to therateT (2+d)/(3+d).3 Motivated by personalized pricing, we impose additional assumptionson f (x ,p) (smoothness and local concavity, see Assumption 3), and improve the optimalrate to T (2+d)/(4+d) as a result. The additional assumption may act as a special margincondition (Tsybakov et al., 2004; Rigollet and Zeevi, 2010; Perchet and Rigollet, 2013)and a�ects the optimal rate. It is unclear whether the algorithm in Slivkins (2014) couldbe adapted to accommodate the additional assumptions and achieve the improved rate.The design of our algorithm is based on adaptively partitioning the covariate space into

3Applying our assumptions and following Equation (8) of Slivkins (2014) give dx + dp = d + 1/2, andthe regret improves to T (1.5+d )/(2.5+d ).

6

rectangular bins, rather than overlapping balls as in Slivkins (2014).

2. Problem Formulation

Suppose the personalized demand function (purchasing probability) is d(x ,p) ∈ [0, 1],where x ∈ [0, 1)d is the observed feature vector, or covariate, summarizing the cus-tomer’s personalized information, andp is the price set by the �rm. Since the purchasingevent is a Bernoulli random variable with success rate d(x ,p) when the �rm sets pricep for a customer of covariate x (henceforth abbreviated to customer x), the expectedrevenue is thus f (x ,p) , pd(x ,p). If the �rm knew d(x ,p), or equivalently, f (x ,p),then it would set p∗(x) , argmaxp∈[0,1] f (x ,p). Denote the optimal expected revenuefrom customer x by f ∗(x) , maxp∈[0,1] f (x ,p). We will be primarily dealing with theexpected revenue f (x ,p) instead of the personalized demand d(x ,p).

Initially, neither d(x ,p) nor f (x ,p) is known to the �rm. In period t ∈ {1, 2, . . . ,T },a customer arrives with covariate Xt . Upon observing Xt , the �rm sets a price pt .The revenue earned in period t is denoted by Zt where Zt/pt is a Bernoulli randomvariable with mean d(Xt ,pt ), independent of everything else. The objective of the�rm is to design a pricing policy to maximize the total revenue over the horizon∑T

t=1 E[Zt ] =∑T

t=1 E[f (Xt ,pt )]. Note that pt itself is likely to be random even thoughthe �rm is not adopting a randomized policy. This is because the pricing decision madein period t may depend on the observed customers, set prices, and earned revenues inthe previous periods. That is, pt = πt (X1,p1,Z1, . . . ,Xt−1,pt−1,Zt−1,Xt ). Formally, werefer to π as the pricing policy that determines how pt depends on the past information.We also denote Ft , σ (X1,p1,Z1, . . . ,Xt ,pt ,Zt ).

2.1. Regret

To measure the performance of a pricing policy, it is standard in the literature to bench-mark it against the so-called clairvoyant policy and study the regret. Suppose f (x ,p) isknown to a clairvoyant �rm. The optimal pricing policy is rather straightforward for aclairvoyant: having observed customer Xt , set p∗(Xt ) in period t and earn a randomrevenue with mean f ∗(Xt ).

For the �rm, the expected revenue in period t cannot exceed that of the clairvoyant:

7

f (Xt ,pt ) ≤ f ∗(Xt ). Thus, we de�ne the regret of a pricing policy π to be the revenuegap

Rπ (T ) =T∑t=1

E [(f ∗(Xt ) − f (Xt ,pt ))] .

In period t , the expectation is taken with respect to the distribution of Xt as well aspt , which itself depends on Ft−1 and Xt . Our goal is to design a policy π that achievessmall Rπ (T ) when T →∞.

However, because Rπ (T ) also depends on the unknown function f , we require thedesigned policy to perform well for a family C of functions, i.e., we seek for optimalpolicies in terms of the minimax regret

infπ

supf ∈C

Rπ (T ).

Although it is usually impossible to �nd the exact policy that achieves the minimaxregret, we focus on proposing a policy whose regret is at least comparable to (of thesame order as) the minimax regret asymptotically when T →∞.

In this paper, we study functions f that cannot be parametrized. It implies thatthe family C is much larger than parametric families: it includes all the functionsthat satisfy some mild assumptions presented in the next section. In other words, theworst-case scenario can potentially be much worse than a parametric family. As aresult, the achievable minimax regret is also higher.

2.2. Assumptions

In this section, we formally provide a set of assumptions that f ∈ C and the stochasticprocess has to satisfy and their justi�cations.

Assumption 1. The covariates Xt are i.i.d. for t = 1, . . . ,T . Given Xt and pt , therevenue Zt is independent of everything else.

Both i.i.d. covariates and independent noise structure are standard in the literature.

Assumption 2. The functions f (·,p) and f (x , ·) are Lipschitz continuous given p

and x , i.e., there exists M1 > 0 such that | f (x1,p) − f (x2,p)| ≤ M1‖x1 − x2‖2 and| f (x ,p1) − f (x ,p2)| ≤ M1 |p1 − p2 | for all xi and pi (i = 1, 2) in the domain.

8

This assumption is equivalent to | f (x1,p1) − f (x2,p2)| ≤ M1(‖x1 − x2‖2 + |p1 −p2 |).Lipschitz continuity is a common assumption in the learning literature. In personalizedpricing, it implies that the expected revenues are close if the �rm charges similar pricesfor two customers with similar covariates. If this assumption fails, then the historicalsales data of a certain type of customer is not informative for a new customer withalmost identical background and learning is virtually impossible.

To introduce the next assumption, consider any hyper-rectangle B ⊂ [0, 1)d , includ-ing a singleton B = {x}. De�ne fB(p) , E [f (X ,p)|X ∈ B] for p ∈ [0, 1]. Clearly fB(p)is the expected revenue when charging p for a customer that is sampled from a subsetB.

Assumption 3. We assume that for any B,

1. The function fB(p) has a unique maximizer p∗(B) ∈ [0, 1]. Moreover, there existuniform constants M2,M3 > 0 such that for all p ∈ [0, 1], M2(p∗(B) − p)2 ≤fB(p∗(B)) − fB(p) ≤ M3(p∗(B) − p)2.

2. The maximizer p∗(B) is inside the interval [inf{p∗(x) : x ∈ B}, sup{p∗(x) : x ∈B}].

3. Let dB be the diameter of B. Then there exists a uniform constant M4 > 0 suchthat sup{p∗(x) : x ∈ B} − inf{p∗(x) : x ∈ B} ≤ M4dB .

This assumption is quite di�erent from those in the setting without covariates(Besbes and Zeevi, 2009; Wang et al., 2014; Lei et al., 2017) or the parametric setting (Banand Keskin, 2017; Qiang and Bayati, 2016). To explain the intuition of fB(p), supposethe �rm only observes I{X∈B} but not the exact value of X , and thus cannot applypersonalized pricing for customers in B. In this case, the learning objective is theexpected revenue fB(p) and the clairvoyant policy that has the knowledge of f (x ,p)is to set p = p∗(B). This class of learning problems are important subroutines of thealgorithm we propose and Assumption 3 guarantees that they can be e�ectively learned.

For part one of Assumption 3, we have

Proposition 1. If fB(p) is continuous for p ∈ [0, 1], and twice di�erentiable in an openinterval containing the unique global maximizer p∗(B) with f ′′B (p∗(B)) < 0, then part oneof Assumption 3 holds.

9

Therefore, part one states that fB(p) is smooth and locally concave around themaximum. If B is a singleton, then it can be viewed as a weaker version of the concavityassumption in Wang et al. (2014); Lei et al. (2017), i.e., 0 > a > f ′′(p) > b for all pin their no-covariate setting. As a result, if fB(p) is the revenue function of linear orexponential demand, then part one is satis�ed automatically.

Part two of Assumption 3 prevents the following scenario: If the optimal price forthe aggregate demand of customers x ∈ B, p∗(B), is far from the optimal personalizedpricing p∗(x), then collecting more information for fB(p) does not help to improve thepricing decision for any individual customer x ∈ B. Such obstacle may lead to failureto learn and is thus ruled out by the assumption. Similar types of assumptions havebeen imposed in other applications of revenue management. For example, Proposition1.16 in Gallego and Topaloglu (2018) provides conditions under which the optimal priceof the aggregated market lies in the convex hull formed by the optimal prices of eachmarket segment when discriminatory pricing is allowed.

Part three imposes a continuity condition for the optimal price. It is equivalent to,for example, some form of continuous di�erentiability of f (x ,p), because p∗(x) solvesthe implicit function from the �rst-order condition fp(x ,p) = 0.

Remark 1. Assumption 2 is a variant of similar assumptions adopted in the literature.Assumption 3, although appearing nonstandard, is also satis�ed by the parametricfamilies studied by previous works. We give a few examples that satisfy Assumption 3.

• Dynamic pricing with linear covariate (Qiang and Bayati, 2016): if f (x ,p) =p(θTx − αp), then fB(p) = p(θTE[X |X ∈ B] − αp) and p∗(B) = θTE[X |X ∈ B]/2α .

• Separable function: consider f (x ,p) = ∑ki=1 дi(x)hi(p). Then fB(p) =

∑ki=1 E[дi(X )|X ∈

B]hi(p). If hi(p) are concave functions and дi(x) are positive, then we may beable to solve the unique maximizer p∗ (B) = E[д(X )|X ∈ B] for some continuousfunction д.

• Localized functions: the covariate only plays a role in a subset B0 ⊂ [0, 1)d . SeeSection 5 for a concrete example.

As we shall see in Sections 4 and 5, the optimal rate of regret under Assumption 3 isT (2+d)/(4+d), in contrast toT (2+d)/(3+d) without it (takingdY = 1 in Equation (3) of Slivkins2014). Technically, we suspect that Assumption 3 plays a similar rule to the margin

10

condition in the contextual bandit literature (Tsybakov et al., 2004; Goldenshlugeret al., 2009; Rigollet and Zeevi, 2010; Perchet and Rigollet, 2013). However, because ofthe continuous decision variable studied in this paper, the margin condition cannotbe translated in a straightforward way. It remains a future direction to present theassumption in a general form (the degree of smoothness such as the Hölder and Sobolevclasses) and study how it a�ects the optimal rate of regret.

We summarize the information available to the �rm. In the beginning of the horizon,the length of the horizon T , the dimension d and the constants {Mi}4i=1 are revealed4.In period t , the price can also depend on Ft−1 and Xt .

3. The ABE Algorithm

We next present a set of preliminary concepts related to the bins of the covariate space,and then introduce the proposed pricing policy: the Adaptive Binning and Exploration(ABE) algorithm.

3.1. Preliminary Concepts

De�nition 1. A bin is a hyper-rectangle in the covariate space. More precisely, a binis of the form

B = {x : ai ≤ xi < bi , i = 1, . . . ,d}

for 0 ≤ ai < bi ≤ 1, i = 1, . . . ,d .

We can split a bin B by bisecting it in all the d dimensions to obtain 2d child bins ofB, all of equal size. For a bin B with boundaries ai and bi for i = 1, . . . ,d , its childrenare indexed by i ∈ {0, 1}d and have the form

Bi =

{x : aj ≤ xj <

aj + bj

2if ij = 0,

aj + bj

2≤ xj < bj if ij = 1, j = 1, . . . ,d

}.

Denote the set of child bins of B by C(B). Conversely, for any B′ ∈ C(B), we refer to B

as the parent bin of B′, denoted by P(B′) = B.4In fact, only M2 is needed in the algorithm.

11

Our algorithm starts with a root bin B∅ , [0, 1)d , which contains all possiblecustomers, and successively splits the bin as more data is collected. Therefore, any binB produced during the process is the o�spring of B∅, i.e., P (k)(B) = B∅ for some k > 0,where P (k) is the kth composition of the parent function. Equivalently, B∅ is an ancestorof B. For such a bin, we de�ne its level to be k , denoted by l(B) = k . Conventionally, letl(B∅) = 0.

In the algorithm, we keep a dynamic partition Pt of the covariate space consisting ofthe o�spring of B∅ in each period t . The partition is mutually exclusive and collectivelyexhaustive, so Bi ∩Bj = ∅ for Bi ,Bj ∈ Pt , and ∪Bi∈PtBi = B∅. Initially P0 = {B∅}. In thealgorithm, we gradually re�ne the partition; that is, each bin in Pt+1 has an ancestor(or itself) in Pt .

3.2. Intuition

The intuition behind the ABE algorithm is to use a partition Pt of the covariate spaceto aggregate customers in each period. It tries to �nd the optimal price for customersin each bin B ∈ Pt , i.e., p∗(B) de�ned in Section 2.2. As Pt is re�ned dynamically, i.e.,B ∈ Pt becomes smaller, such aggregation is almost identical to personalized pricing.

To do that, we keep a set of discrete prices (referred to as the decision set hereafter)for each bin in the partition. The decision set consists of equally spaced grid pointsof a price interval associated with the bin. When a customer arrives with covariateXt inside a bin B, a price is chosen successively in the decision set and charged forthe customer. The realized revenue for this price is recorded. When a large numberof customers are observed in B, the average revenue for each price p in the decisionset is close to fB(p), which is de�ned as E[f (X ,p)|X ∈ B] in Section 2.2. Therefore, theempirically-optimal price in the decision set is close to p∗(B), with high con�dence.

There are two potential pitfalls of this approach. First, the number of prices has animpact on the performance of the policy. If there are too many prices in a decision set,then for a given number of customers observed in the bin, each price is experimentedwith for a relatively few times. As a result, the con�dence interval for the associatedaverage revenue is wide. On the other hand, if there are too few prices, then inevitablythe decision set has low resolutions. That is, the optimal price in the set could still befar from the true maximizer p∗(B) because of the discretization error. We have to select

12

a proper size of the decision set to balance this trade-o�.Second, even if the optimal price p∗(B) for the aggregate revenue in the bin is

correctly identi�ed, it may not be a strong indicator for p∗(x) for a particular customerx ∈ B. Indeed, fB(p) averages out all customers X ∈ B, and the optimal price for anindividual customer x could be very di�erent. This obstacle, however, can be overcomeas the size of B decreases, as implied by Assumption 3. In particular, part two andthree of the assumption guarantee that when B is small, p∗(x) is concentrated within aneighborhood of p∗(B) as long as x ∈ B. The cost of using a smaller bin, however, isthe less frequency of observing a customer inside it.

To remedy the second pitfall, the algorithm adaptively re�nes the partition anddecreases the size of the bins in Pt as t increases. When a bin B ∈ Pt is large, theaggregate optimal price p∗(B) is not a strong indicator for p∗(x), x ∈ B. As a result,we only need a rough estimate and split the bin when a relatively small number ofcustomers are observed in B. When a bin B ∈ Pt is small, the optimal price p∗(B)provides a strong indicator for p∗(x), x ∈ B. Therefore, we gather large sales data fromcustomers X ∈ B to explore the decision set and estimate p∗(B) accurately, before itsplits.

A crucial step in the algorithm is to determine what information to inherit when abin is split into child bins. The ABE algorithm records the empirically-optimal price inthe decision set of the parent bin. In the child bins, we use this information and set uptheir decision sets centered at it. As explained above, when the parent bin (and thusthe child bins) is large, its optimal price does not predict those of the child bins well.Therefore, the algorithm sets up conservative decision sets for the child bins, i.e., theyhave wide intervals. On the other hand, when the parent bin is small, its optimal priceprovides an accurate indicator for those of the child bins. Thus, the algorithm constructsdecision sets with narrow ranges for the child bins around the empirically-optimalprice inherited.

3.3. Description of the Algorithm

In this section, we elaborate on the detailed steps of the ABE algorithm, shown inAlgorithm 1.

The parameters for the algorithm include

13

Algorithm 1 Adaptive Binning and Exploration (ABE)1: Input: T , d2: Constants: M1, M2, M3, M43: Parameters: K ; ∆k , nk , Nk for k = 0, . . . ,K4: Initialize: partition P ← {B∅}, pB∅l ← 0, pB∅u ← 1, δB∅ ← 1/(N0−1), YB,j ,NB∅,j ← 0

for j = 0, . . . ,N0 − 15: for t = 1 to T do6: Observe Xt

7: B ← {B ∈ P : Xt ∈ B} . The bin in the partition that Xt belongs to8: k ← l(B), N (B) ← N (B) + 1 . Determine the level and update the number of

customers observed in B9: if k < K then . If not reaching the maximal level K

10: if N (B) < nk then . If not enough data observed in B11: j ← N (B) − 1 (mod Nk) . Apply the jth price in the decision set12: pt ← pB

l+ jδB ; apply pt and observe Zt

13: YB,j ← 1NB, j+1 (NB,jYB,j + Zt ), NB,j ← NB,j + 1

14: else . If su�cient data observed in B15: j∗ ∈ argmaxj∈{0,1,...,Nk−1}{YB,j}, p∗ ← pB

l+ j∗δB . Find the

empirically-optimal price; if there are multiple, choose any one of them16: P ← (P \ B) ∪C(B) . Update the partition by removing B and adding

its children17: for B′ ∈ C(B) do . Initialization for each child bin18: N (B′) ← 019: pB

′

l← max{0,p∗ − ∆k+1/2}; pB

′u ← min{1,p∗ + ∆k+1/2} . The

range of the decision set20: δB′ ← (pB

′u − pB

′

l)/(Nk+1 − 1) . The grid size of the decision set

21: NB′,j , YB′,j ← 0, for j = 0, . . . ,Nk+1 − 1 . Initialize the averagerevenue and number of customers for each price

22: end for23: end if24: else . If reaching the maximal level25: pt ← (pBl + p

Bu )/2

26: end if27: end for

14

1. K , the maximal level of the bins. When a bin is at level K , the algorithm nolonger splits it and simply applies the median price of its decision set whenever acustomer is observed in it.

2. ∆k , the length of the interval that contains the decision set of level-k bins.

3. nk , the maximal number of customers observed in a level-k bin in the partition.When nk customers are observed, the bin splits.

4. Nk , the number of prices to explore in the decision set of level-k bins. The decisionset of bin B consists of equally spaced grid points of an interval [pB

l,pBu ], to be

adaptively speci�ed by the algorithm.

We initialize the partition to include only the root bin B∅ in Step 4. Its decision setspans the whole interval [0, 1] with N0 equally spaced grid points. That is, the jth priceis jδB∅ , j/(N0 − 1) for j = 0, . . . ,N0 − 1. The initial average revenue and the numberof customers that are charged the jth price are set to YB∅,j = NB∅,j = 0.

Suppose the partition is Pt at t and a customerXt is observed (Step 6). The algorithmdetermines the bin B ∈ Pt which the customer belongs to. The counter N (B) recordsthe number of customers already observed in B up to t when B is in the partition(Step 8). If the level of B is l(B) = k < K (i.e., B is not at the maximal level) and thenumber of customers observed in B is not su�cient (Step 9 and Step 10), then thealgorithm has assigned a decision set to the bin in previous steps, namely, {pB

l+ jδB}

for j = 0, . . . ,Nk − 1. There are Nk prices in the set and they are equally spaced inthe interval [pB

l,pBu ]. They are explored successively as new customers are observed

in B (explore pBl

for the �rst customer observed in B, pBl+ δB for the second customer,

. . . , pBl+ (Nk − 1)δB for the Nk th customer, pB

lagain for the (Nk + 1)th customer, etc.).

Therefore, the algorithm charges price pt = pBl+ jδB where j = N (B) − 1 (mod Nk)

for the N (B)th customer observed in B (Step 11). Then, Step 13 updates the averagerevenue and the number of customers for the jth price.

If the level of B is l(B) = k < K and we have observed a su�cient number ofcustomers in B (Step 9 and Step 14), then the algorithm splits B and replaces it by its2d child bins in the partition (Step 16). For each child bin, Step 18 to Step 21 initializethe counter, the interval that encloses the decision set, the grid size of the decisionset, and the average revenue/number of customers for each price in the decision set,

15

respectively. In particular, to construct the decision set of a child bin, the algorithm�rst computes the empirically-optimal price in the decision set of the parent bin B; thatis, j∗ ∈ argmaxj∈{0,1,...,Nk−1}{YB,j} in Step 15. Then, the algorithm creates an intervalcentered at this empirically-optimal price with width ∆k+1, properly cut o� by theboundaries [0, 1]. The decision set is then an equally spaced grid of the above interval(Step 19 and Step 20).

If the level of B is already K , then the algorithm simply charges the median price(Step 25) repeatedly without further exploration. For such a bin, its size is su�cientlysmall and the algorithm has narrowed the range of the decision setK times. The chargedprice is close enough to all p∗(x), x ∈ B, with high probability.

3.4. Choice of Parameters

We set K = b log(T )(d+4) log(2)c, ∆k = 2−k log(T ), Nk = dlog(T )e, and

nk = max

{0,

⌈24k+15

M22 log3(T )

(log(T ) + log(log(T )) − (d + 2)k log(2))⌉}.

To give a sense of their magnitudes, the edge length of the bins at the maximal level isapproximately T −1/(d+4). The range of the decision set (∆k ) is proportional to the edgelength of the bin (2−k ). The number of prices in a decision set is approximately log(T ).Therefore, the grid size is δB ≈ 2−k for a level-k . The number of customers to observein a level-k bin B is roughly nk ≈ 24k/log(T )2 before it splits. When k is small, nk canbe zero according to the expression. In this case, the algorithm immediately splits thebin without collecting any sales data in it.

3.5. A Schematic Illustration

We illustrate the key steps of the algorithm by an example withd = 2. Figure 1 illustratesa possible outcome of the algorithm in periods t1 < t2 < t3 (top panel, mid panel, andbottom panel respectively). Up until period t1, there is a single bin and the covariatesof observed customers Xt for t ≤ t1 are illustrated in the top left panel. In this case,the decision set associated with the root bin is p ∈ {0.1, 0.2, . . . , 0.9}, illustrated by thetop right panel. The average revenue YB,j of each price is recorded, and p∗ = 0.6 is the

16

empirically-optimal price. At t1 + 1, a su�cient number of customers are observed andStep 14 is triggered in the algorithm. Therefore, the bin is split into four child bins.

From period t1 + 1 to t2, new customers are observed in each child bin (mid leftpanel). Note that the customers observed before t1 in the parent bin are no longer usedand colored in gray. For each child bin (the bottom-left bin is abbreviated to BL, etc.),the average revenues for the prices in the decision sets is demonstrated in the midright panel. The decision sets are centered at the empirically-optimal price of theirparent bin, which is p∗ = 0.6 from the top right panel. They have narrower ranges and�ner grids than that of the parent bin. At t2 + 1, a su�cient number of customers areobserved in BL, and it is split into four child bins.

From period t2 + 1 to t3, the partition consists of seven bins, as shown in the bottomleft panel. The BR, TL and TR bins keep observing customers and updating the averagerevenues, because they have not collected su�cient data. Their status at t3 is shown inthe bottom panels. In the four newly created child bins of BL (the bottom-left bin of BLis abbreviated to BL-BL, etc.), the prices in the decision sets are used successively andtheir average revenues are illustrated in the bottom right panel.

4. Regret Analysis: Upper Bound

To measure the performance of the ABE algorithm, we provide an upper bound for itsregret.

Theorem 1. For any function f satisfying Assumption 2 and 3, the regret incurred by theABE algorithm is bounded by

RπABE (T ) ≤ CT2+d4+d log(T )2

for a constant C > 0 that is independent of T .

We provide a sketch of the proof here and present the details in the appendix. Inperiod t , if Xt ∈ B for a bin in the partition B ∈ Pt , then the expected regret incurred bythe ABE algorithm is E[(f ∗(Xt ) − f (Xt ,pt ))I{Xt ∈B,B∈Pt }]. Since the total regret simplysums up the above quantity over t = 1, . . . ,T and all possible Bs, it su�ces to focuson the regret for given t and B. Two possible scenarios can arise: (1) the optimal price

17

0 10

1

x1

x2

0 0.2 0.4 0.6 0.8 1p

YB,j

0 10

1

x1

x2

0 0.2 0.4 0.6 0.8 1p

YB,j

BLBRTLTR

0 10

1

x1

x2

0 0.2 0.4 0.6 0.8 1p

YB,j

BL-BLBL-BRBL-TLBL-TR

Figure 1: A schematic illustration of the ABE algorithm.

18

of the aggregate demand in B, i.e., p∗(B), is inside the range of the decision set, i.e.,p∗(B) ∈ [pB

l,pBu ] (Step 19); (2) the optimal price p∗(B) is outside the range of the decision

set.Scenario one represents the regime where the algorithm is working “normally”:

up until t , the algorithm has successfully narrowed the optimal price p∗(B) (whichprovides a useful indicator for all p∗(x), x ∈ B when B is small) down to [pB

l,pBu ]. By

Assumption 3 part one, the regret in this scenario can be decomposed into two terms

f ∗(Xt ) − f (Xt ,pt ) ≤ M3(|pt − p∗(B)| + |p∗(B) − p∗(Xt )|)2.

The �rst term can be bounded by the length of the interval pBu − pBl . The second termcan be bounded by the size of B given Xt ∈ B by Assumption 3 part two and three.By the choice of parameters in Section 3.4, the length of the interval decreases as thebin size decreases. Therefore, both terms can be well controlled when the size of B issu�ciently small, or equivalently, when the level l(B) is su�ciently large. This is whya properly chosen nk can guarantee that the algorithm spends little time for large binsand collect a large amount of data for small bins. When the bin level reaches K , theabove two terms are small enough and no more exploration is needed.

Scenario two represents the regime where the algorithm works “abnormally”. Inscenario two, the di�erence f ∗(Xt )− f (Xt ,pt ) can no longer be controlled as in scenarioone because pt and p∗(Xt ) can be far apart. To make things worse, p∗(B) < [pB

l,pBu ]

usually implies p∗(B′) < [pB′l,pB

′u ], where B′ is a child of B. This is because (1) p∗(B) is

close to p∗(B′) for small B, and (2) [pB′l,pB

′u ] is created around the empirically-optimal

price for B, and thus overlapping with [pBl,pBu ]. Therefore, for any period s following t ,

the worst-case regret is O(1) in that period if Xs ∈ B or its o�spring.To bound the regret in scenario two, we have to bound the probability, which

requires delicate analysis of the events. If scenario two occurs for B, then during theprocess that we sequentially split B∅ to obtain B, we can �nd an ancestor bin of B(which can be B itself) that scenario two happens for the �rst time along the “branch”from B∅ all the way down to B. More precisely, denoting the ancestor bin by Ba and itsparent by P(Ba), we have (1) p∗(P(Ba)) is inside [pP(Ba )

l,pP(Ba )u ] (scenario one); (2) after

P(Ba) is split, p∗(Ba) is outside [pBal,pBau ] (scenario two). Denote the empirically-optimal

price in the decision set of P(Ba) by p∗. For such an event to occur, the center of the

19

decision set of Ba, which is p∗, has to be at least ∆l(Ba )/2 away from p∗(Ba).5 Becauseof Assumption 3 and the choice of ∆k , the distance between p∗(P(Ba)) and p∗(Ba) isrelatively small compared to ∆l(Ba ). Therefore, the empirically-optimal price p∗ must befar away from p∗(P(Ba)). The probability of such event can be bounded using classicconcentration inequalities for sub-Gaussian random variables: the prices that are closerto p∗(P(Ba)) and thus have higher means turn out to generate lower average revenuethan p∗; this event is extremely unlikely to happen when we have collected a largeamount of sales data for each price in the decision set.

The total regret aggregates those in scenario one and two for all possible combi-nations of B and Ba . It matches the quantity O(log(T )2T (2+d)/(4+d)) presented in thetheorem.

5. Regret Analysis: Lower Bound

In this section, we show that the minimax regret is no lower than cT (2+d)/(4+d) for someconstant c . Combining with the last section, we conclude that no non-anticipatingpolicy does better than the ABE algorithm in terms of the order of magnitude of theregret in T (neglecting logarithmic terms).

We �rst construct a family of functions that satisfy Assumption 2 and 3. Thefunctions in the family are selected to be “di�cult” to distinguish. By doing so, wewill prove that any policy has to spend a substantial amount of time exploring pricesthat generate low revenues but help to di�erentiate the functions. Otherwise, theincapability to correctly identify the underlying function is costly in the long run.Therefore, unable to contain both sources of regret at the same time, no policy canachieve lower regret than the quantity stated in Theorem 2.

Before introducing the family of functions, we de�ne ∂B to be the boundary of aconvex set in [0, 1)d . Let D(B1,B2) be the Euclidean distance between two sets B1 andB2. That is D(B1,B2) , inf {‖x1 − x2‖2 : x1 ∈ B1,x2 ∈ B2}. We allow B1 or B2 to be asingleton. To de�ne C, we partition the covariate space [0, 1)d into Md equally sized

5Recall that pBau − pBal = ∆l (Ba ).

20

bins. That is, each bin has the following form: for (k1, . . . ,kd) ∈ {1, . . . ,M}d ,{x :

ki − 1M≤ xi <

kiM, ∀ i = 1, . . . ,d

}.

We number those bins by 1, . . . ,Md in an arbitrary order, i.e., B1, . . . ,BMd . Each functionf (x ,p) ∈ C is indexed by a tuplew ∈ {0, 1}Md , whose jth index determines the behaviorof fw (x ,p) in Bj . More precisely, for x ∈ Bj , the personalized demand function is

dw (x ,p) =

23 −

p2 wj = 0

23 −

p2 +

( 13 −

p2)D(x , ∂Bj) wj = 1

and thus

fw (x ,p) =p

( 23 −

p2)

wj = 0

p( 2

3 −p2 +

( 13 −

p2)D(x , ∂Bj)

)wj = 1

The optimal personalized price for customer x ∈ Bj is p∗(x) = 2/3 if wj = 0 andp∗(x) = 2+D(x ,∂Bj )

3(1+D(x ,∂Bj )) if wj = 1.The construction of C follows a similar idea to Rigollet and Zeevi (2010). For a given

fw ∈ C, we can always �nd another function fw ′ ∈ C that only di�ers from f in a singlebin Bj by setting w′ to be equal to w except for the jth index. The �rm can only rely onthe covariates generated in Bj to distinguish between fw and fw ′. For small bins (i.e.,large M), this is particularly costly because there are only a tiny fraction of customersobserved in a particular bin and the di�erence | fw − fw ′ | = p(1/3 − p/2)D(x , ∂Bj)becomes tenuous. It also requires p to be far from 2/3 to detect the di�erence, whichhappens to be the optimal price when wj = 0. This makes the exploration/exploitationtrade-o� hard to balance. Now a policy has to carry out the task for Md bins, i.e.,distinguishing the underlying function fw with Md tuples that only di�er from w inone index. The cost is inevitable and adds to the lower bound of the regret. Moreover,we assume the customers X are uniformly distributed in [0, 1)d .

In the appendix, we show that the constructed C satis�es all the assumptions. Themain theorem below shows the lower bound for the regret.

21

Theorem 2. For the constructed C, any non-anticipating policy π has regret

supf ∈C

Rπ (T ) ≥ cT2+d4+d

for a constant c > 0.

6. Future Research

As shown in Theorem 1 and Theorem 2, the best achievable regret of the problem is oforderT (2+d)/(4+d). As a result, the knowledge of the sparsity structure of the covariate isessential in designing pricing policies. More precisely, the provided customer covariateis of dimension d , while the personalized demand d(X ,p)may only depend on d′ entriesof the covariate where d′ � d . In this case, being able to identify the d′ entries out ofd signi�cantly decreases the incurred regret from T (2+d)/(4+d) to T (2+d

′)/(4+d ′). Indeed,in the ABE algorithm, if the sparsity structure is known, then a bin is split into 2d ′

instead of 2d child bins. It pools the observations that only di�er in the dimensionscorresponding to the redundant covariates so that more observations are available in abin, and thus substantially reduces the exploration cost. An important research questionis then whether it is possible to design a binning algorithm that selects one dimensionand the position to split, based on a certain criterion, like regression/classi�cationdecision trees (Hastie et al., 2001). This may signi�cantly improve the regret in thepresence of sparse covariates.

References

Agrawal, R. (1995). The continuum-armed bandit problem. SIAM journal on control andoptimization 33(6), 1926–1951.

Agrawal, S. and N. Goyal (2012). Analysis of thompson sampling for the multi-armedbandit problem. In Conference on Learning Theory, pp. 39–1.

Araman, V. F. and R. Caldentey (2009). Dynamic pricing for nonperishable productswith demand learning. Operations research 57 (5), 1169–1188.

22

Auer, P., R. Ortner, and C. Szepesvári (2007). Improved Rates for the Stochastic Continuum-Armed Bandit Problem, pp. 454–468. Berlin, Heidelberg: Springer Berlin Heidelberg.

Ban, G. and N. B. Keskin (2017). Personalized dynamic pricing with machine learning.Working paper .

Bastani, H. and M. Bayati (2015). Online decision-making with high-dimensionalcovariates. Working paper .

Besbes, O. and A. Zeevi (2009). Dynamic pricing without knowing the demand function:Risk bounds and near-optimal algorithms. Operations Research 57 (6), 1407–1420.

Besbes, O. and A. Zeevi (2012). Blind network revenue management. Operationsresearch 60(6), 1537–1550.

Broder, J. and P. Rusmevichientong (2012). Dynamic pricing under a general parametricchoice model. Operations Research 60(4), 965–980.

Bubeck, S. and N. Cesa-Bianchi (2012). Regret analysis of stochastic and nonstochasticmulti-armed bandit problems. Foundations and Trends® in Machine Learning 5(1),1–122.

Bubeck, S., R. Munos, G. Stoltz, and C. Szepesvári (2011). X-armed bandits. Journal ofMachine Learning Research 12(May), 1655–1695.

Cesa-Bianchi, N. and G. Lugosi (2006). Prediction, learning, and games. Cambridgeuniversity press.

Chen, N. and G. Gallego (2018). A primal-dual learning algorithm for personalizeddynamic pricing with an inventory constraint. Working paper .

Cheung, W. C., D. Simchi-Levi, and H. Wang (2017). Dynamic pricing and demandlearning with limited price experimentation. Operations Research 65(6), 1722–1731.

Cohen, M. C., I. Lobel, and R. Paes Leme (2016). Feature-based dynamic pricing. Workingpaper .

23

den Boer, A. V. (2015). Dynamic pricing and learning: historical origins, current research,and new directions. Surveys in operations research and management science 20(1),1–18.

den Boer, A. V. and B. Zwart (2014). Simultaneously learning and optimizing usingcontrolled variance pricing. Management Science 60(3), 770–783.

Elmachtoub, A. N., R. McNellis, S. Oh, and M. Petrik (2017). A practical method forsolving contextual bandit problems using decision trees. Working paper .

Farias, V. F. and B. Van Roy (2010). Dynamic pricing with a prior on market response.Operations Research 58(1), 16–29.

Foucart, S. and H. Rauhut (2013). A mathematical introduction to compressive sensing,Volume 1. Birkhäuser Basel.

Gallego, G. and H. Topaloglu (2018). Revenue management and pricing analytics. Inpreparation.

Goldenshluger, A. and A. Zeevi (2013). A linear response bandit problem. StochasticSystems 3(1), 230–261.

Goldenshluger, A., A. Zeevi, et al. (2009). Woodroofe’s one-armed bandit problemrevisited. The Annals of Applied Probability 19(4), 1603–1633.

Hastie, T., R. Tibshirani, and J. Friedman (2001). The Elements of Statistical Learning.Springer Series in Statistics. New York, NY, USA: Springer New York Inc.

Javanmard, A. and H. Nazerzadeh (2016). Dynamic pricing in high-dimensions. Workingpaper .

Keskin, N. B. and A. Zeevi (2014). Dynamic pricing with an unknown demand model:Asymptotically optimal semi-myopic policies. Operations Research 62(5), 1142–1167.

Kleinberg, R., A. Slivkins, and E. Upfal (2008). Multi-armed bandits in metric spaces.In Proceedings of the fortieth annual ACM symposium on Theory of computing, pp.681–690. ACM.

24

Kleinberg, R. D. (2005). Nearly tight bounds for the continuum-armed bandit problem.In Advances in Neural Information Processing Systems, pp. 697–704.

Kuleshov, V. and D. Precup (2014). Algorithms for multi-armed bandit problems. Workingpaper .

Langford, J. and T. Zhang (2008). The epoch-greedy algorithm for multi-armed banditswith side information. In Advances in neural information processing systems, pp.817–824.

Lei, Y., S. Jasin, and A. Sinha (2017). Near-optimal bisection search for nonparametricdynamic pricing with inventory constraint. Working paper .

Nambiar, M., D. Simchi-Levi, and H. Wang (2016). Dynamic learning and price opti-mization with endogeneity e�ect. Working paper .

Perchet, V. and P. Rigollet (2013). The multi-armed bandit problem with covariates. TheAnnals of Statistics 41(2), 693–721.

Qiang, S. and M. Bayati (2016). Dynamic pricing with demand covariates. Workingpaper .

Rigollet, P. and A. Zeevi (2010). Nonparametric bandits with covariates. In A. T. Kalaiand M. Mohri (Eds.), COLT, pp. 54–66. Omnipress.

Slivkins, A. (2014). Contextual bandits with similarity information. The Journal ofMachine Learning Research 15(1), 2533–2568.

Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation (1 ed.). Springer-VerlagNew York.

Tsybakov, A. B. et al. (2004). Optimal aggregation of classi�ers in statistical learning.The Annals of Statistics 32(1), 135–166.

Wang, Z., S. Deng, and Y. Ye (2014). Close the gaps: A learning-while-doing algorithm forsingle-product revenue management problems. Operations Research 62(2), 318–331.

Yang, Y., D. Zhu, et al. (2002). Randomized allocation with nonparametric estimation fora multi-armed bandit problem with covariates. The Annals of Statistics 30(1), 100–121.

25

Online Appendix for Nonparametric Pricing Analytics withCustomer Covariates

A. Table of Notations

dxe The smallest integer that does not exceed x(x)+ The positive part of x#{} The cardinality of a setFt The σ -algebra generated by (X1,π1,Z1, . . . ,Xt−1,πt−1,Zt−1)µX The distribution of the covariate X over [0, 1)d

P∗(B) argmaxp∈[0,1] {E[f (X ,p)|X ∈ B]}P∗B The empirically-optimal decision in the decision set for B∂B The boundary of B

Table 1: A table of notations used in the paper.

B. ProofsProof of Proposition 1: De�ne for p ∈ [0, 1]

д(p) ={

fB (p∗(B))−fB (p)(p∗(B)−p)2 p , p∗(B)− f ′′B (p

∗(B))2 p = p∗(B).

By L’Hopital’s rule, д(p) is continuous at p∗(B). In addition, because fB(p) is continuous,д(p) is continuous for all p ∈ [0, 1]. By Weierstrass’s extreme value theorem, we haveд(p) ∈ [M2,M3] and both M2 and M3 are attained. By the de�nition and the uniquenessof the maximizer, д(p) > 0 for p ∈ [0, 1]. Therefore, we must have M2,M3 > 0. Thisestablishes the result. �

B.1. Upper Bound in Section 4We �rst introduce the following lemmas.

Lemma 1. A Bernoulli random variable X satis�es

E[exp(t(X − E[X ])] ≤ exp(t2/8)

This lemma follows directly from the Hoe�ding’s lemma. It implies that a Bernoullirandom variable is sub-Gaussian.

26

Lemma 2. Suppose for given x and p, the random variable Z (x ,p) is sub-Gaussian withparameter σ , i.e.,

E[exp(t(Z − E[Z ])] ≤ exp(σt2)for all t ∈ R. Then the distribution of Z (X ,p) conditional on X ∈ B for a set B is stillsub-Gaussian with the same parameter.

Proof. Let µX denote the distribution of X . We have that for all t ∈ R

E[exp(t(Z (X ,p) − E[Z (X ,p)|X ∈ B])|X ∈ B]

=

∫BE[exp(t(Z (x ,p) − E[Z (X ,p)|X ∈ B]dµX (x)∫

BdµX (x)

=

∫BE[exp(t(Z (x ,p) − E[Z (x ,p)])]dµX (x)∫

BdµX (x)

×∫B

exp(tE[Z (x ,p)])dµX (x)exp(tE[Z (X ,p)|X ∈ B])

∫BdµX (x)

≤∫B

exp(σt2)dµX (x)∫BdµX (x)

× 1 = exp(σt2),

where the last inequality is by the de�nition of conditional expectations. Hence theresult is proved. �

Proof of Theorem 1: According to the algorithm (Step 7), let Pt denote the partitionformed by the bins at time t when Xt is generated. The regret associated with Xt canbe counted by bins B ∈ Pt into which Xt falls. Meanwhile, the level of B is at most K .Therefore,

RπABE (T ) = E

[T∑t=1(f ∗(Xt ) − f (Xt ,pt ))

]= E

[T∑t=1

∑B∈Pt(f ∗(Xt ) − f (Xt ,pt ))I{Xt ∈B}

]= E

T∑t=1

K∑k=0

∑{B:l(B)=k}

(f ∗(Xt ) − f (Xt ,pt ))I{Xt ∈B,B∈Pt }

We will de�ne the following random event for each bin B:

EB ={p∗(B) ∈

[pBl ,p

Bu

]}.

Recall thatp∗(B) is the unique maximizer for fB(p) = E[f (X ,p)|X ∈ B] by Assumption 3;[pB

l,pBu ] is the range of the decision set to explore for B. According to Step 19 of the

ABE algorithm, the interval [pBl,pBu ] is constructed around p∗

P(B), the empirically-optimalprice of the parent bin P(B) that maximizes the empirical average YP(B),j .

27

We will decompose the regret depending on whether EB occurs.

E[RπABE ] = E

T∑t=1

K∑k=0

∑{B:l(B)=k}

(f ∗(Xt ) − f (Xt ,pt ))I{Xt ∈B,B∈Pt ,EcB}

︸︷︷︸term 1

+ E

T∑t=1

K∑k=0

∑{B:l(B)=k}

(f ∗(Xt ) − f (Xt ,pt ))I{Xt ∈B,B∈Pt ,EB }

︸︷︷︸term 2

(1)

We �rst analyze term 1. Because EB∅ is always true ([pB∅l,pB∅u ] = [0, 1] always

encloses p∗(B∅) according to Step 4), we can �nd an ancestor of B, say Ba (which canbe B itself), such that EcBa ∩ EP(Ba ) ∩ EP(P(Ba )) ∩ . . . ∩ EB∅ occurs. In other words, upuntil Ba , the algorithm always correctly encloses the optimal price p∗(P (k)(Ba)) of theancestor bin of Ba in the intervals

[pP(k )(Ba )

l,pP

(k )(Ba )u

]. Therefore, when EcB occurs, we

can rearrange the event by such Ba . Term 1 in (1) can be bounded by

E

T∑t=1

K∑k=0

∑{B:l(B)=k}

(f ∗(Xt ) − f (Xt ,pt ))I{Xt ∈B,B∈Pt ,EcB}

≤

T∑t=1

K∑k=0

∑{B:l(B)=k}

M1P(Xt ∈ B,B ∈ Pt ,EcB)

=

T∑t=1

K∑k=1

∑{B:l(B)=k}

M1P(Xt ∈ B,B ∈ Pt ,EcBa ∩ EP(Ba ) ∩ EP(P(Ba )) ∩ . . . ∩ EB∅

)≤

T∑t=1

K∑k=1

∑{B:l(B)=k}

M1P(Xt ∈ B,B ∈ Pt ,EcBa ∩ EP(Ba ))

=

T∑t=1

K∑k=1

∑{B:l(B)=k}

K∑k ′=1

∑B′:l(B′)=k ′

M1P(Xt ∈ B,B ∈ Pt ,Ba = B′,EcB′ ∩ EP(B′)) (2)

The �rst inequality is due to Assumption 2. In the second inequality, we start enumer-ating from k = 1 instead of k = 0 because EcB∅ never occurs. In the last equality, werearrange the probabilities by counting the deterministic bins B′ instead of the randombins Ba .

Now note that{Xt ∈ B,B ∈ Pt ,Ba = B′,EcB′ ∩ EP(B′)

}are exclusive for di�erent Bs

because P is a partition and Xt can only fall into one bin. Moreover, {Xt ∈ B,B ∈

28

Pt ,Ba = B′,EcB′ ∩ EP(B′)} ⊂{Xt ∈ B′,EcB′ ∩ EP(B′)

}because B ⊂ Ba . Therefore,

K∑k=1

∑{B:l(B)=k}

P(Xt ∈ B,B ∈ Pt ,Ba = B′,EcB′ ∩ EP(B′)

)≤ P

(Xt ∈ B′,EcB′ ∩ EP(B′)

).

Thus, we can further simplify (2):

(2) ≤T∑t=1

K∑k=1

∑{B′:l(B′)=k}

M1P(Xt ∈ B′,EcB′ ∩ EP(B′)

)=

T∑t=1

K∑k=1

∑{B:l(B)=k}

M1P(Xt ∈ B)P(EcB ∩ EP(B)). (3)

The last equality is because of the fact that for given t and B, the event {Xt ∈ B} ∈ σ (Xt )and EcB ∩ EP(B) ∈ Ft−1. Therefore, the two events are independent.

Next we analyze the event EcB ∩ EP(B) given l(B) = k in order to bound (3). Thisevent implies that when the parent bin P(B) is created, its optimal price p∗(P(B)) isinside the interval [pP(B)

l,pP(B)u ]. At the end of Step 14, when nk−1 customers have been

observed in P(B), it is split into 2d children. The optimal price of its child bin B, thatis p∗(B), is no longer inside [pB

l,pBu ]. By Step 19, pB

l= max{p∗

P(B) − ∆k/2, 0} and pBu =

min{p∗P(B) + ∆k/2, 1}, where p∗

P(B) is the empirically-optimal price for P(B). Therefore,EcB implies that p∗(B) < [p∗

P(B) − ∆k/2,p∗P(B) + ∆k/2]. Combined with Assumption 3 parttwo, which states that p∗(B) ∈ [inf{p∗(x) : x ∈ B}, sup{p∗(x) : x ∈ B}] ⊂ inf{p∗(x) :x ∈ P(B)}, sup{p∗(x) : x ∈ P(B)}, we have

[inf{p∗(x) : x ∈ P(B)}, sup{p∗(x) : x ∈ P(B)}] 1 [p∗P(B) − ∆k/2,p∗P(B) + ∆k/2].

That is, either inf{p∗(x) : x ∈ P(B)} < p∗P(B) − ∆k/2 or sup{p∗(x) : x ∈ P(B)} > p∗

P(B) +

∆k/2. By Assumption 3 part three, sup{p∗(x) : x ∈ P(B)} − inf{p∗(x) : x ∈ P(B)} ≤M4dP(B) = M4

√d2−(k−1) because the level of P(B) is k − 1. Hence by Assumption 3 part

two, inf{p∗(x) : x ∈ P(B)} ≥ p∗(P(B)) − M4√d2−(k−1) and sup{p∗(x) : x ∈ P(B)} ≤

p∗(P(B)) + M4√d2−(k−1). Combining the above observations, EcB could only happen

when |p∗(P(B)) − p∗P(B) | > ∆k/2 −M4

√d2−(k−1). On the other hand, EP(B) implies that

p∗(P(B)) ∈ [pP(B)l,pP(B)u ]. Therefore, EcB ∩ EP(B) could occur only if there exist two grid

points 0 ≤ j1, j2 ≤ Nk−1 − 1 in Step 14 for bin P(B), such that

1. The j2th grid point is the closest to the optimal price for the bin p∗(P(B)). That is,|pP(B)l+ j2δP(B) − p∗(P(B))| ≤ δP(B)/2.

29

2. The j1th grid point maximizes YP(B),j . That is pP(B)l+ j1δP(B) = P∗

P(B). It implies thatYP(B),j1 ≥ YP(B),j2 in Step 15.

3. |p∗(P(B)) − p∗P(B) | > ∆k/2 −M4

√d2−(k−1).

In other words, the empirically-optimal price is the j1th grid point, while the j2th gridpoint is closest to the true revenue maximizer in bin P(B), i.e., p∗(P(B)). Given that thetwo grid points are far apart (by point 3 above), the probability of this event should besmall.

To further bound the probability, consider YP(B),j . In Step 15, it is the sum ofbnk−1/Nk−1c or dnk−1/Nk−1e independent random variables with mean E[f (X ,pP(B)

l+

jδP(B))|X ∈ P(B)]. By Lemmas 1 and 2, they are still sub-Gaussian with parameterσ = 1/8. This gives the following probabilistic bound (recall the de�nition of fB(p) inSection 2.2):

P(EcB ∩ EP(B)) ≤ P(YP(B),j1 ≥ YP(B),j2)

= P( 1t1

t1∑i=1

X (1)i −1t2

t2∑i=1

X (2)i ≥ fP(B)(pP(B)l+ j2δP(B)) − fP(B)(pP(B)l

+ j1δP(B)))

= P( 1t1

t1∑i=1

X (1)i −1t2

t2∑i=1

X (2)i ≥ fP(B)(pP(B)l+ j2δP(B)) − fP(B)(p∗(P(B)))

+ fP(B)(p∗(P(B))) − fP(B)(pP(B)l+ j1δP(B))

)≤ P

( 1t1

t1∑i=1

X (1)i −1t2

t2∑i=1

X (2)i ≥ M2

(pP(B)l+ j1δP(B) − p∗(P(B))

)2

−M3(pP(B)l+ j2δP(B) − p∗(P(B)))2

)≤ P

( 1t1

t1∑i=1

X (1)i −1t2

t2∑i=1

X (2)i ≥ M2

((∆k/2 −M4

√d2−(k−1)

)+)2

−M3δ2P(B)/4

).

Here t1 and t2 can be either bnk−1/Nk−1c or dnk−1/Nk−1e; X (1)i and X (2)i are independentmean-zero sub-Gaussian random variables with parameter σ . Their averages are thecentered version of YP(B),j1 and YP(B),j2 , and thus their means are moved to the right-handside. In the last inequality, (·)+ represents the positive part. The inequality follows fromAssumption 3 part one and the previously derived facts that |pP(B)

l+ j2δP(B)−p∗(P(B))| ≤

δP(B)/2 and |pP(B)l+ j1δP(B) − p∗(P(B))| ≥ ∆k/2 −M4

√d2−(k−1). By the property of sub-

Gaussian random variables (for example, see Theorem 7.27 in Foucart and Rauhut,

30

2013), the above probability is bounded by

P(EcB ∩ EP(B)) ≤ exp

©«−

((M2

((∆k/2 −M4

√d2−(k−1)

)+)2−M3δ

2P(B)/4

)+)2

4σ (1/t1 + 1/t2)

ª®®®®®¬≤ exp

©«−nk−1

((M2

((∆k/2 −M4

√d2−(k−1)

)+)2−M3δ

2P(B)/4

)+)2

Nk−1 + 1

ª®®®®®¬By our choice of parameters, ∆k = 2−k log(T ), Nk ≡ dlog(T )e, δP(B) ≤ ∆k−1/Nk−1 ≤2−(k−1). Therefore, when T ≥ max{exp(8M4

√d), exp(4

√2M3/M2)}, we have:

∆k

4−M4

√d2−(k−1) = 2−(k+2) log(T ) −M4

√d2−(k−1) ≥ 0

⇒ ∆k

2−M4

√d2−(k−1) ≥ ∆k

4M2∆

2k

32−M3δ

2P(B)

4≥ M22−2k log2(T )

32−M32−2k ≥ 0

⇒ M2(∆k/2 −M4

√d2−(k−1))2 −M3δ

2P(B)/4 ≥

M2∆2k

16−M3δ

2P(B)

4≥

M2∆2k

32.

Therefore, there exists a constant c1 = M22/1024 such that

P(EcB ∩ EP(B)) ≤ exp

(−c1

∆4knk−1

log(T ) + 1

).

With this bound, we can proceed to provide an upper bound for (3). Because∑{B:l(B)=k} P(Xt ∈

B) = 1, we have

T∑t=1

K∑k=1

∑{B:l(B)=k}

M1P(Xt ∈ B)P(EcB ∩ EP(B)) ≤ M1TK∑k=1

exp

(−c1

∆4knk−1

log(T ) + 1

). (4)

We next analyze term 2 of (1). By Assumption 3 part one, f ∗(Xt ) − f (Xt ,pt ) ≤M3(p∗(Xt ) − pt )2 ≤ M3(|p∗(B) − pt | + |p∗(Xt ) − p∗(B)|)2. By the design of the algorithm(Step 12 and 25), pt ∈ [pBl ,p

Bu ]; conditional on the event EB = {p∗(B) ∈ [pBl ,p

Bu ]}, we have

|p∗(B)−pt | ≤ pBu−pBl ≤ ∆k for l(B) = k . On the other hand, by Assumption 3 part two and

31

three, |p∗(Xt ) − p∗(B)| ≤ sup{p∗(x) : x ∈ B} − inf{p∗(x) : x ∈ B} ≤ M4dB ≤ M4√d2−k

for l(B) = k . Therefore, term 2 can be bounded by

E

T∑t=1

K∑k=0

∑{B:l(B)=k}

(f ∗(Xt ) − f (Xt ,pt ))I{Xt ∈B,B∈Pt ,EB }

≤ E

T∑t=1

K∑k=0

∑{B:l(B)=k}

M3(∆k +M4√d2−k)2I{Xt ∈B,B∈Pt ,EB }

≤ E

K−1∑k=0

∑{B:l(B)=k}

T∑t=1

M3(∆k +M4√d2−k)2I{Xt ∈B,B∈Pt }

(5)

+

T∑t=1

M3(∆K +M4√d2−K )2

∑{B:l(B)=K}

P(Xt ∈ B)

For the �rst term in (5), note that {Xt ∈ B,B ∈ Pt } occurs for at most nk times for givenB with l(B) = k . Moreover, there are 2dk bins with level k , i.e., #{B : l(B) = k} = 2dk .Therefore, substituting ∆k = 2−k log(T ) into the �rst term yields an upper bound∑K−1

k=0 M3nk2(d−2)k(log(T )+M4√d)2. For the second term in (5),

∑{B:l(B)=K} P(Xt ∈ B) = 1

because {B : l(B) = K} form a partition of the covariate space and X always falls intoone of the bins. Therefore, (5) is bounded by

M3

(K−1∑k=0

nk2(d−2)k +T 2−2K

) (log(T ) +M4

√d)2

log(T )2

≤c3 log(T )2M3

(K−1∑k=0

nk2(d−2)k +T 2−2K

)(6)

for c3 = (1 +M3)(log(2) +M4

√d)2/log(2)2 and T ≥ 2.

Combining (4) and (6), we can �nd constants c2 = c1/25 = M22/215 such that

E[RπABE ] ≤K−1∑k=0

(c3 log(T )2nk2(d−2)k +M1T exp(−c12−4k−4 log3(T )nk/2)

)+ c3 log(T )2T 2−2K

≤K−1∑k=0

(c3 log(T )2nk2(d−2)k +M1T exp(−c22−4k log3(T )nk)

)+ c3 log(T )2T 2−2K

(7)

32

We choose nk

nk = max{0,

⌈24k+15

M22 log(T )3

(log(T ) + log(log(T )) − (d + 2)k log(2))⌉}

to minimize c3 log(T )2nk2(d−2)k +M1T exp(−c22−4k log(T )3nk) in (7). More precisely,

c3 log(T )2nk2(d−2)k ≤ c3M−22 215 log(T )2

log(T )3 2(d+2)k(log(T ) + log(log(T )))

≤ c42(d+2)k

for some constants c4 > 0, and

M1T exp(−c22−4k log3(T )nk) ≤ M1T exp(− log(T ) − log(log(T )) + (d + 2)k log(2))≤ c52(d+2)k

for a constant c5 > 0. Therefore, (7) implies that we can �nd a constant c6 = c4 + c5such that

E[RπABE ] ≤K−1∑k=0

c62(d+2)k + c3 log(T )2T 2−2K

≤ c62(d+2)K + c3 log(T )2T 2−2K .

Therefore, by our choice of K = b log(T )(d+4) log(2)c, the regret is bounded by

c7 log2(T )T d+2d+4 .

for some constant c7. Hence we have completed the proof. �

B.2. Lower Bound in Section 5Next we show that Assumption 1 and 2 are satis�ed by the construction in Section 5.

Proposition 2. The choice of f ∈ C satis�es Assumption 2 and 3 with M1 = 4, M2 = 1,M3 = 2, andM4 = 1.

To give some intuitions, note that by the construction of f , both fw (x ,p) and p∗(x)are Lipschitz continuous in [0, 1)d . Such continuity guarantees the desired properties.

Proof of Proposition 2: For Assumption 2, we discuss two cases. The �rst case is x1,x2 ∈

33

Bj , i.e., the two customers are in the same bin. In this case,

| fw (x1,p1) − fw (x2,p2)| ≤{|p1(23 −

p12 ) − p2(23 −

p22 )| ≤ 2|p1 − p2 | wj = 0

2|p1 − p2 | + |p1(13 −p12 )D(x1, ∂Bj) − p2(13 −

p22 )D(x2, ∂Bj)| wj = 1

When wj = 0, the assumption is already satis�ed. When wj = 1, by the triangleinequality we have

|p1(13− p1

2)D(x1, ∂Bj) − p2(

13− p2

2)D(x2, ∂Bj)|

≤2|p1 − p2 |D(x1, ∂Bj) + p2

(13− p2

2

)|D(x1, ∂Bj) − D(x2, ∂Bj)|

≤ 1M|p1 − p2 | + |D(x1, ∂Bj) − D(x2, ∂Bj)|

≤|p1 − p2 | + ‖x1 − x2‖2

The second inequality is because p2 ≥ 0 and D(x1, ∂Bj) ≤ 1/2M ≤ 1/2 when x1 ∈ Bj .The third inequality is because

‖x1 − x2‖2 + D(x2, ∂Bj) = mina∈∂Bj{‖a − x2‖2 + ‖x1 − x2‖2} ≥ min

a∈∂Bj{‖a − x1‖2}

= D(x1, ∂Bj)

and similarly ‖x1 − x2‖2 + D(x1, ∂Bj) ≥ D(x2, ∂Bj). Therefore, we have shown that| fw (x1,p1) − fw (x2,p2)| ≤ 3|p1 − p2 | + 2‖x1 − x2‖2 for case one.

The second case is x1 ∈ Bj1 and x2 ∈ Bj2 for j1 , j2. If j1 = j2 = 0, then by theprevious analysis, we already have | fw (x1,p1) − fw (x2,p2)| ≤ 2|p1 − p2 |. If j1 = 1 andj2 = 0, then

| fw (x1,p1) − fw (x2,p2)| ≤ 2|p1 − p2 | + p1

(13− p1

2

)D(x1, ∂Bj1) ≤ 2|p1 − p2 | + 2‖x1 − x2‖2.

The last inequality is because the straight line connecting x1 and x2 must intersectsBj1 . The distance from x1 to the intersection is no less than D(x1, ∂Bj1). Therefore,D(x1, ∂Bj1) ≤ ‖x1 − x2‖2. If j1 = 0 and j2 = 1, then the result follows similarly. Ifj1 = j2 = 1, then by the same argument,

|p1

(13− p1

2

)D(x1, ∂Bj1) − p2

(13− p2

2

)D(x2, ∂Bj2)| ≤ D(x1, ∂Bj1) + D(x2, ∂Bj2) ≤ 2‖x1 − x2‖2.

34

Therefore, combining both cases, we always have

| fw (x1,p1) − fw (x2,p2)| ≤ 4|p1 − p2 | + 4‖x1 − x2‖2.

For Assumption 3, note that

fB(p) = p(23− p

2

)+ p

(13− p

2

)E

Md∑j=1

wjI{X∈Bj}D(X , ∂Bj)|X ∈ B

= p

(23− p

2

)+ p

(13− p

2

) Md∑j=1

wjP(Bj ∩ B)E[D(X , ∂Bj)|X ∈ Bj ∩ B].

For part one, because the second-order derivative of fB(p) is always bounded between[−2,−1],

2(p − p∗(B))2 ≥ fB(p∗(B)) − fB(p) ≥ (p − p∗(B))2

and part one holds for M2 = 1, M3 = 2. For part two, note that the maximizer

p∗(B) =2 +

∑Md

j=1wjP(Bj ∩ B)E[D(X , ∂Bj)|X ∈ Bj ∩ B]

3(1 +∑Md

j=1wjP(Bj ∩ B)E[D(X , ∂Bj)|X ∈ Bj ∩ B]).

Because p∗(B) is a monotone function of∑Md

j=1wjP(Bj ∩ B)E[D(X , ∂Bj)|X ∈ Bj ∩ B], it iseasy to check that part two of Assumption 3 holds. For part three, consider x1 ∈ B ∩ Bj1

and x2 ∈ B ∩ Bj2 . If wj1 = 0 and wj2 = 0, then p∗(x1) − p∗(x2) = 0. If either wj1 = 0 orwj2 = 0, then

|p∗(x1) − p∗(x2)| ≤13

max{D(x1, ∂Bj1),D(x2, ∂Bj2)

}≤ ‖x1 − x2‖2 ≤ dB .

by the previous analysis. Ifwj1 = 1 andwj2 = 1, then similarly we have |p∗(x1)−p∗(x2)| ≤13 |D(x1, ∂Bj1) − D(x2, ∂Bj2)| ≤ ‖x1 − x2‖2 ≤ dB . Therefore, part three holds withM4 = 1. �

The proof of Theorem 2 uses Kullback-Leibler (KL) divergence to measure the“distinguishability” of the underlying functions. Such information-theoretic approachhas been a standard technique in the learning literature. The proof is outlined in thefollowing.

Among all functions f ∈ C, we focus on each pair of fw and fw ′ that only di�erin a single bin. For example, consider w = (w1,w2, . . . ,wj−1, 0,wj+1, . . . ,wMd ) andw′ = (w1,w2, . . . ,wj−1, 1,wj+1, . . . ,wMd ) for some j. Because the indices of w and w′

35

are identical except for the jth, fw and fw ′ only di�er in bin Bj . Denote w = (w−j , 0)and w′ = (w−j , 1) to highlight this fact. Distinguishing between fw−j ,0 and fw−j ,1 poses achallenge to any policy. In particular, for x ∈ Bj , the di�erence of the two functions| fw−j ,0(x ,p) − fw−j ,1(x ,p)| = p(13 −

p2 )D(x , ∂Bj) is diminishing when in p ≈ 2/3. Thus,

charging a price di�erent from 2/3 makes the di�erence more visible and helps todistinguish fw−j ,0 and fw−j ,1. However, if p deviates too much from the optimal decisionp∗(x) = 2/3 or p∗(x) = (2 + D(x , ∂Bj))/(3(1 + D(x , ∂Bj))), then signi�cant regret isincurred in that period.

To capture this trade-o�, for a given j = 1, . . . ,Md and w−j ∈ {0, 1}Md−1, de�ne the

following quantity

zw−j =T∑t=1

911M2E

πfw−j ,0

[(23− pt

)2I{Xt ∈Bj}

](8)

where the expectation is taken with respect to a policy π and the underlying functionfw−j ,0. This quantity is crucial in analyzing the regret. More precisely, if zw−j is large(which implies that pt is large), then fw−j ,0 and fw−j ,1 are easy to distinguish but theregret becomes uncontrollable.Lemma 3.

supf ∈C

Rπ ≥11M2

9 × 2Md M2Md∑j=1

∑w−j

zw−j .

On the other hand, if zw−j is small, then the KL divergence of the measures associatedwith fw−j ,0 and fw−j ,1 is also small. In other words, the �rm cannot easily distinguishbetween fw−j ,0 and fw−j ,1 which impedes learning and incurs substantial regret.Lemma 4.

supf ∈C

Rπ ≥M2T

9 × 2Md+11Md+2

Md∑j=1

∑w−j

exp(−zw−j ).

Since the e�ects of zw−j are opposite in Lemma 3 and Lemma 4, combining the twobounds, we can �nd a positive constant c1 independent of T and M so that

Rπ ≥c1

2Md

Md∑j=1

∑w−j

(T

Md+2 exp(−zw−j

)+M2zw−j

)≥ c1

2Md

Md∑j=1

∑w−j

M2(1 + log

(T

Md+4

))≥ c1M

d+2

2

(1 + log

(T

Md+4

)).

36

In the second inequality above, we minimize the expression over positive zw−j . SinceM can be an arbitrary positive integer, we let M = dT 1/(d+4)e in the last quantity.Calculation shows that it is lower bounded by cT (2+d)/(4+d) for a constant c > 0.

To prove Theorem 2, i.e., Lemma 3 and Lemma 4, we introduce the following lemmas.

Lemma 5 (KL divergence for Bernoulli Random Variables). For two Bernoulli randomvariables X1 and X2 with means θ1 and θ2, we have

K(µX1, µX2) ≤(θ1 − θ2)2θ2(1 − θ2)

.

Proof. The proof can be found in, e.g., Rigollet and Zeevi (2010) and thus omitted. �

Lemma 6 (The chain rule of the KL divergence). Given joint distributions p(x ,y) andq(x ,y), we have

K(p(x ,y),q(x ,y)) = K(p(x),q(x)) + Ep(x)[K(p(y |x),q(y |x)],

where p(·) and q(·) represent the marginal distribution, p(·|x) and q(·|x) represent theconditional distribution.

Proof. The proof can be found from standard textbooks and is thus omitted. �

Proof of Lemma 3: We use Eπf

to highlight the dependence of the expectation on thepolicy π and the underlying function f . Note that

supf ∈C

Rπ = supf ∈C

T∑t=1

E [f ∗(Xt ) − f (Xt ,pt )] = supf ∈C

T∑t=1

Md∑j=1

E[(f ∗(Xt ) − f (Xt ,pt ))I{Xt ∈Bj}

]≥ M2 sup

f ∈C

T∑t=1

Md∑j=1

E[(p∗(Xt ) − pt )2I{Xt ∈Bj}

]≥ M2

2Md

∑w

T∑t=1

Md∑j=1

Eπfw

[(p∗(Xt ) − pt )2I{Xt ∈Bj}

].

In the last inequality, we have used the fact that #{C} = 2Md and the supremum isalways no less than the average.

For a given bin Bj , we focus on fw−j ,0 and fw−j ,1, which only di�er for x ∈ Bj .Therefore, we can rearrange

∑w to

∑w−j∈{0,1}Md−1

∑w j∈{0,1}. We have the following

37

lower bound for the regret

supf ∈C

Rπ ≥M2

2Md

Md∑j=1

∑w−j∈{0,1}M

d−1

∑w j∈{0,1}

T∑t=1

Eπfw−j ,wj

[(p∗(Xt ) − pt )2I{Xt ∈Bj}

]≥ M2

2Md

Md∑j=1

∑w−j

T∑t=1

Eπfw−j ,0

[(p∗(Xt ) − pt )2I{Xt ∈Bj}

]=

M2

2Md

Md∑j=1

∑w−j

T∑t=1

Eπfw−j ,0

[(23− pt

)2I{Xt ∈Bj}

]=

11M2

9 × 2Md M2Md∑j=1

∑w−j

zw−j .

In the second inequality, we have neglected the regret for fw−j ,1. The last equality is bythe de�nition of zw−j in (8). Hence we have proved the result. �

Proof of Lemma 4: By the same argument as in the proof of Lemma 3, we have

supf ∈C

Rπ ≥M2

2Md

Md∑j=1

T∑t=1

∑w−j∈{0,1}M

d−1

∑w j∈{0,1}

Eπfw−j ,wj

[(p∗(Xt ) − pt )2I{Xt ∈Bj}

].

Because Xt is uniformly distributed in [0, 1)d , P(Xt ∈ Bj) = M−d . By conditioning onthe event Xt ∈ Bj , we have

Eπfw−j ,wj

[(p∗(Xt ) − pt )2I{Xt ∈Bj}

]= Eπfw−j ,wj

[(p∗(Xt ) − pt )2 |Xt ∈ Bj

]P(Xt ∈ Bj)

=1Md

Eπfw−j ,wj

[(p∗(Xt ) − pt )2 |Xt ∈ Bj

].

Since (p∗(Xt ) − pt )2 is measurable with respect to the σ -algebra generated by Ft−1 andXt , by the tower property, we have

Eπfw−j ,wj

[(p∗(Xt ) − pt )2I{Xt ∈Bj}

]=

1Md

Eπfw−j ,wj

[E

[(p∗(Xt ) − pt )2 |Ft−1,Xt ∈ Bj

] ]Let Eπ ,t−1

fw−j ,wj[·] denote Eπ

fw−j ,wj[E[·|Ft−1]] and let PB,t−1

Xt(·) denote the conditional proba-

38

bility P(·|Ft−1,Xt ∈ B). By Markov’s inequality, for any constant s > 0 we have∑w j∈{0,1}

Eπfw−j ,wj

[(p∗(Xt ) − pt )2I{Xt ∈Bj}

](9)

=1Md

∑w j∈{0,1}

Eπfw−j ,wj

[E

[(p∗(Xt ) − pt )2 |Xt ∈ Bj ,Ft−1

] ]≥ 1

Md

∑w j∈{0,1}

s2

M2Eπfw−j ,wj

[PBj ,t−1Xt

(��p∗(Xt ) − pt�� ≥ s

M

)]=

s2

Md+2

(Eπfw−j ,0

[PBj ,t−1Xt

(| 23− pt | ≥

s

M

)]+ Eπfw−j ,1

[PBj ,t−1Xt

(|

2 + D(Xt , ∂Bj)3(1 + D(Xt , ∂Bj))

− pt | ≥s

M

)] )≥ s2

Md+2

(Eπfw−j ,0

[PBj ,t−1Xt

(| 23− pt | ≥

s

M,A

)](10)

+ Eπfw−j ,1

[PBj ,t−1Xt

(|

2 + D(Xt , ∂Bj)3(1 + D(Xt , ∂Bj))

− pt | ≥s

M,A

)] )where we de�ne event A = {Xt ∈ Bj} ∩ {D(Xt , ∂Bj) > 12s/M}. In the second equality,we have used the fact that for Xt ∈ Bj , when wj = 0, p∗(Xt ) = 2

3 ; when wj = 1,p∗(Xt ) = 2+D(Xt ,∂Bj )

3(1+D(Xt ,∂Bj )) . The motivation of introducing A is as follows: consider theclassi�cation rule Πt 7→ {0, 1} associated with pt tries to distinguish between wj = 0and wj = 1. It is de�ned as

Πt =

{0 | 23 − pt | ≤ |

2+D(Xt ,∂Bj )3(1+D(Xt ,∂Bj )) − pt |

1 otherwise.

In other words, Πt classi�es the underlying function as fw−j ,0 if pt is closer to the optimalprice p∗(Xt ) of fw−j ,0, and as fw−j ,1 vice versa. For fw−j ,0, a misclassi�cation on the eventA is A ∩ {Πt = 1}. It implies that

|2/3 − pt | ≥12×

��23 − 2 + D(Xt , ∂Bj)3(1 + D(Xt , ∂Bj))

�� = D(Xt , ∂Bj)6(1 + D(Xt , ∂Bj))

,

which onA, impliesA∩{|2/3−pt | ≥ s/M} as D(Xt , ∂Bj) ≤ 1. Similarly, A∩{Πt = 0} ⊂A ∩ {| 2+D(Xt ,∂Bj )

3(1+D(Xt ,∂Bj )) − pt | ≥ s/M}. Therefore, by the fact that P(X ∈ A) = (1 − 24s)d/Md ,

39

we have

Eπfw−j ,0

[PBj ,t−1Xt

(|1 − pt | ≥

s

M,A

)]+ Eπfw−j ,1

[PBj ,t−1Xt

(|1 − D(Xt , ∂Bj) − pt | ≥

s

M,A

)]≥ Eπfw−j ,0

[PBj ,t−1Xt

(A ∩ {Πt = 1})]+ Eπfw−j ,1

[PBj ,t−1Xt

(A ∩ {Πt = 0})]

= (1 − 24s)d(Pπfw−j ,0 (Πt = 1|Xt ∈ A) + Pπfw−j ,1 (Πt = 0|Xt ∈ A)

). (11)

Next we lower bound the misclassi�cation error (11) by the Kullback-Leibler (KL)divergence between the two probability measures associated with fw−j ,0 and fw−j ,1.Intuitively, if the two probability measures are close, then no classi�cation (includingΠt ) can incur very small misclassi�cation error. Formally, introduce the KL divergencebetween two probability measures P and Q as

K(P ,Q) ={∫

log dPdQdP if P � Q

+∞ otherwise,

where P � Q indicates that P is absolute continuous w.r.t. Q . By the independenceof Ft−1 and Xt , the two measures we want to distinguish in (11), µπ

fw−j ,0(·|Xt ∈ A) and

µπfw−j ,1(·|Xt ∈ A), can be expressed as product measures

µπfw−j ,0(·|Xt ∈ A) = µπ ,t−1

fw−j ,0(·) × µAXt

(·)

µπfw−j ,1(·|Xt ∈ A) = µπ ,t−1

fw−j ,1(·) × µAXt

(·),

where µπ ,t−1fw−j ,0(·) is a measure of (X1,Z1, . . . ,Xt−1,Zt−1) depending on π and fw−j ,0 and

µAXt(·) is a measure of Xt conditional on Xt ∈ A. By Theorem 2.2 (iii) in Tsybakov (2009),

(11) ≥ (1 − 24s)d2

exp(−K

(µπ ,t−1fw−j ,0× µAXt

, µπ ,t−1fw−j ,1× µAXt

))=(1 − 24s)d

2exp

(−K

(µπ ,t−1fw−j ,0, µπ ,t−1

fw−j ,1

)− Eπ ,t−1

fw−j ,0

[K

(µAXt, µAXt

)] )=(1 − 24s)d

2exp

(−K

(µπ ,t−1fw−j ,0, µπ ,t−1

fw−j ,1

)). (12)

The second line follows from Lemma 6; the third line follows from the fact that µAXtis

the same distribution for fw−j ,0 and fw−j ,1, independent of Ft−1.

40

To further simplify the expression, note that µπ ,tfw−j ,0(·) can be decomposed as

µπ ,tfw−j ,0(·) = µπ ,t−1

fw−j ,0(·) × µX (·) × µZtfw−j ,0(·|Ft−1,Xt ),

where µX is the measure (uniform distribution) of Xt and µZtfw−j ,0(·|Ft−1,Xt ) is the mea-

sure of Zt conditional on Ft−1 and Xt . We apply Lemma 6 again:

K(µπ ,tfw−j ,0, µπ ,t

fw−j ,1

)= K

(µπ ,t−1fw−j ,0, µπ ,t−1

fw−j ,1

)+ Eπ ,t−1

fw−j ,0

[K(µXt , µXt )

]+ Eπ ,t−1

fw−j ,0

[EX

[K

(µZtf w−j ,0

(·|Ft−1,Xt ), µZtf w−j ,1(·|Ft−1,Xt ))] ].

It is easy to see that the second term is zero. For the third term, we �rst conditionalon Ft−1 and then on the covariate Xt . Because pt depends only on Ft−1 and Xt , pt isthe same for fw−j ,0 and fw−j ,1 conditional on Ft−1 and Xt . Therefore, µZt

f w−j ,0(·|Ft−1,Xt )

and µZtf w−j ,1

(·|Ft−1,Xt ) are two Bernoulli distributions with means dw−j ,0(Xt ,pt ) anddw−j ,1(Xt ,pt ), respectively. By Lemma 5, we have

K(µZtf w−j ,0

(·|Ft−1,Xt ), µZtf w−j ,1(·|Ft−1,Xt ))≤

(dw−j ,0(Xt ,pt ) − dw−j ,1(Xt ,pt )

)2

dw−j ,1(Xt ,pt )(1 − dw−j ,1(Xt ,pt ))

≤ 14411

(dw−j ,0(Xt ,pt ) − dw−j ,1(Xt ,pt )

)2

=14411

(pt

(13− pt

2

)D(Xt , ∂Bj)I{Xt ∈Bj}

)2

≤ 911M2

(23− pt

)2I{Xt ∈Bj} .

In the second inequality, we used the fact that dw−j ,1(Xt ,pt ) ∈ [1/12, 5/6] as long as wechoose M ≥ 2 and thus D(Xt , ∂Bj) ≤ 1/2. In the last inequality, we have used the factthat the distance of a vector inside Bj to the boundary of Bj is at most 1/2M . Therefore,

41

we can obtain an upper bound for K(µπ ,tfw−j ,0, µπ ,t

fw−j ,1

)K

(µπ ,tfw−j ,0, µπ ,t

fw−j ,1

)≤

t∑i=1

Eπ ,i−1fw−j ,0

[EX

[K

(µZif w−j ,0

(·|Fi−1,Xi), µZif w−j ,1(·|Fi−1,Xi))] ]

≤t∑

i=1Eπ ,i−1fw−j ,0

[EX

[9

11M2

(23− pt

)2I{Xi∈Bj}

] ]≤

T∑t=1

Eπfw−j ,0

[9

11M2

(23− pt

)2I{Xt ∈Bj}

]= zw−j .

Therefore, combining it with (9), (11) and (12), we have shown the lemma:

supf ∈C

T∑t=1

E [f ∗(Xt ) − f (Xt ,pt )]

≥ M2

2Md

T∑t=1

Md∑j=1

∑w−j

∑w j∈{0,1}

Eπfw−j ,wj

[(p∗(Xt ) − pt )2I{Xt ∈Bj}

]≥ TM2s

2(1 − 24s)2

2Md+1Md+2

Md∑j=1

∑w−j

exp(−zw−j

)=

M2T

9 × 2Md+11Md+2

Md∑j=1

∑w−j

exp(−zw−j

)where in the last step, we have set s = 1/24. �

42

nonparametric learning and optimization with covariatesnonparametric learning and optimization with...

Documents