p´olya tree posterior distributions on densitiesgeorge polya in annales de l’institut henri...

34
olya tree posterior distributions on densities Isma¨ el Castillo Laboratoire de Probabilit´ es et Mod` eles Al´ eatoires (LPMA) UMR 7599, Universit´ es Paris 6 et 7, France [email protected] Abstract olya trees form a popular class of prior distributions used in Bayesian nonpara- metrics. For some choice of parameters, P´olya trees are prior distributions on density functions. In this paper we carry out a frequentist analysis of the induced posterior distributions. We investigate the contraction rate of P´ olya tree posterior densities in terms of the supremum loss and study the limiting shape distribution. A nonparamet- ric Bernstein–von Mises theorem is established, as well as a Bayesian Donsker theorem for the posterior cumulative distribution function. 1 Introduction olya trees are a class of random probability distributions that are commonly used as prior distributions in the Bayesian nonparametrics study of infinite- dimensional statistical models. The name ‘P´ olya tree’ appears in works by Mauldin et al. [34] and Lavine [30] and it has been used since then, although the object and related ones such as tail-free processes already appear in works by Freedman [13]-[14], Kraft [29], and Ferguson [12]. It should be noted that the name ‘P´ olya tree’ is also used for a dierent object, not considered in the present work, in the literature on trees, where it refers to a rooted unordered tree. The origin of the name in the statistical literature comes from a beautiful connection with P´ olya urns, themselves named after the 1930 article [37] by George P´ olya in Annales de l’Institut Henri Poincar´ e. It was indeed shown in [34] that P´ olya trees are de Finetti measures of certain exchangeable sampling schemes defined from a tree of P´ olya urns. In Bayesian nonparametric statistics, the starting point is the construction of a prior distribution, a probability measure that ‘samples at random’ the parameter to be estimated. If the parameter is itself a distribution, one needs to build a ‘distribution on distributions’. A popular distribution on probability measures is the Dirichlet process introduced by Ferguson [11], see [15] for a review of its use in the statistics literature. The Dirichlet process, as we recall below, is actually a special case of P´ olya tree for certain choices of parameters. However, the resulting measure is not directly suited for modelling a smooth 1

Upload: others

Post on 21-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

Polya tree posterior distributions on densities

Ismael Castillo

Laboratoire de Probabilites et ModelesAleatoires (LPMA) UMR 7599, Universites

Paris 6 et 7, France

[email protected]

Abstract

Polya trees form a popular class of prior distributions used in Bayesian nonpara-

metrics. For some choice of parameters, Polya trees are prior distributions on density

functions. In this paper we carry out a frequentist analysis of the induced posterior

distributions. We investigate the contraction rate of Polya tree posterior densities in

terms of the supremum loss and study the limiting shape distribution. A nonparamet-

ric Bernstein–von Mises theorem is established, as well as a Bayesian Donsker theorem

for the posterior cumulative distribution function.

1 Introduction

Polya trees are a class of random probability distributions that are commonlyused as prior distributions in the Bayesian nonparametrics study of infinite-dimensional statistical models. The name ‘Polya tree’ appears in works byMauldin et al. [34] and Lavine [30] and it has been used since then, althoughthe object and related ones such as tail-free processes already appear in worksby Freedman [13]-[14], Kraft [29], and Ferguson [12]. It should be noted thatthe name ‘Polya tree’ is also used for a di↵erent object, not considered in thepresent work, in the literature on trees, where it refers to a rooted unorderedtree. The origin of the name in the statistical literature comes from a beautifulconnection with Polya urns, themselves named after the 1930 article [37] byGeorge Polya in Annales de l’Institut Henri Poincare. It was indeed shown in[34] that Polya trees are de Finetti measures of certain exchangeable samplingschemes defined from a tree of Polya urns.

In Bayesian nonparametric statistics, the starting point is the constructionof a prior distribution, a probability measure that ‘samples at random’ theparameter to be estimated. If the parameter is itself a distribution, one needsto build a ‘distribution on distributions’. A popular distribution on probabilitymeasures is the Dirichlet process introduced by Ferguson [11], see [15] for areview of its use in the statistics literature. The Dirichlet process, as we recallbelow, is actually a special case of Polya tree for certain choices of parameters.However, the resulting measure is not directly suited for modelling a smooth

1

Page 2: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

object such as a density function, as draws from the Dirichlet process are discretealmost surely. On the contrary, di↵erent choices of parameters of the Polya treelead to a probability measure that is absolutely continuous with respect toLebesgue measure, and hence admits a density. In this paper we focus on thistype of ‘density Polya trees’.

Let X = X(n) = (X1

, . . . , Xn) be a sample from an unknown distributionP on the interval [0, 1], and suppose that P admits a density f with respect toLebesgue measure, denoting P = Pf . Following a Bayesian approach, we viewP as random and put an a priori distribution ⇧ on P , equal to a Polya tree dis-tribution. The data X

1

, . . . , Xn are viewed as, given P , i.i.d. from P . From thisone forms the posterior distribution, the conditional law P |X

1

, . . . , Xn, that wedenote ⇧[· |X

1

, . . . , Xn] = ⇧[· |X]. To study the convergence of this random (itdepends on the data) distribution as n ! 1, we undertake a so-called frequen-tist analysis of the Bayesian procedure: we assume that the data has actuallybeen generated i.i.d. from a fixed distribution P

0

= Pf0

, for some density f0

on [0, 1], and are interested in the convergence of the posterior ⇧[· |X(n)] inprobability under Pf

0

, as n ! 1. A natural question is: does the posterior⇧[· |X(n)] converge to �P

0

, a Dirac mass at the ‘true’ distribution? If so, thisis the so-called consistency property of the posterior at P

0

. Further, what canbe said about the rate of convergence, and, perhaps, the form of the limit afterrescaling?

General conditions for consistency of posterior distributions were given inSchwartz [40], and the theory was further developed among others in [1]. ForPolya trees, posterior consistency in density estimation in the weak topologyfollows from results in [31], see also [16], and consistency in the Hellinger topol-ogy was obtained in [1], whose conditions where further refined in [43]. A nextnatural step once consistency is obtained is to investigate the convergence rateof the posterior distribution. This has been the object of much attention inthe last 15 years, with fundamental contributions such as [17], [41], [18], wheregeneral su�cient conditions on model and prior are given ensuring posteriorconvergence towards the true distribution at some rate.

Yet, to the best of our knowledge, there has been no study so far of posteriorconvergence rates when the prior distribution is a Polya tree. One reason may bethat density Polya trees are often perceived as relatively ‘rough’ objects: it canbe shown for instance that the corresponding density has jumps at a countablenumber of points almost surely. From this it could seem as if Polya trees arejust ‘smooth enough’ for consistency, not for rates. One first result in the paperimplies that for well-chosen parameters, Polya trees are able to model smoothfunctions and to induce posterior distributions with optimal convergence ratesin the minimax sense for a range of Holder regularities. Here we will follow amultiscale approach to obtaining rates and limiting shape results, introduced in[6], [7], [5], with connections to semiparametric functionals [8].

In the Bayesian nonparametrics literature, there has been a recent interestin Polya trees and related constructions. Wong and Ma [44] introduce optionalPolya trees, where the tree is cut using stopping times in a data-driven way.The work [36] studies Rubbery Polya trees, an extension of Polya trees thatenables some dependence in the tree while keeping its essential properties un-changed. Quantile pyramids [24] reverse the construction of the measure byfixing the probabilities but making the interval lengths random. As a way to‘smooth’ Polya trees, one can consider mixtures, as in [2], [22]. We also note that

2

Page 3: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

Polya trees are particular cases of the more general class of tail-free processes,introduced in [13], [10]; mixtures of such processes were recently considered in[27].

An in-depth introduction to Polya trees, including their construction as wellas the proofs of many useful properties, can be found in the forthcoming bookby Ghosal and van der Vaart [19]. The present work directly benefited fromtheir exposition on the subject.

1.1 Definition

First let us introduce some notation relative to dyadic partitions. For any fixedindexes (k, l), 0 k < 2l, l � 0, the rational number r = k2�l can be writtenin a unique way as "(r) := "

1

(r) . . . "l(r), its finite expression of length l inbase 1/2 (note that it can end with one or more ‘0’). That is, "i 2 {0, 1} and

k2�l =Pl

i=1

"i(r)2�i. Let E := [l�0

{0, 1}l [ {?} be the set of finite binarysequences. We write |"| = l if " 2 {0, 1}l and |?| = 0.

Let us introduce a sequence of partitions I = {(I")": |"|=l, l � 0} of theunit interval. Here we will consider regular partitions, as defined below. Thisis mostly for simplicity of presentation, and other partitions, based for instanceon quantiles of a given distribution, could be considered as well. Set I? = [0, 1)and, for any " 2 E such that " = "(l, k) is the expression in base 1/2 of k2�l,set

I" :=

k

2l,k + 1

2l

=: I lk.

For any l � 0, the collection of all such dyadic intervals is a partition of [0, 1).A random probability measure P follows a Polya tree distribution PT (A)

with parameters A = {↵", " 2 E} on the sequence of partitions I if there existrandom variables 0 Y" 1 such that,

1. the variables Y"0 for " 2 E are mutually independent and Y"0 follows aBeta(↵"0,↵"1) distribution.

2. for any " 2 E , we have Y"1 = 1� Y"0

3. for any l � 0 and " = "1

. . . "l 2 {0, 1}l, we have

P (I") =l

Y

j=1

Y"1

..."j . (1)

This construction can be visualised using a tree representation, see Figure 1: tocompute the random mass that P assigns to the subset I" of [0, 1], one followsa dyadic tree along the expression of ": "

1

, "1

"2

, . . . , "1

"2

· · · "l = ". The massP (I") is a product of Beta variables whose parameters depend on whether onegoes ‘left’ ("j = 0) or ‘right’ ("j = 1) along the tree:

P (I") =l

Y

j=1; "j=0

Y"1

..."j�1

0

⇥l

Y

j=1; "j=1

(1� Y"1

..."j�1

0

). (2)

This construction uniquely defines a random probability distribution on distri-butions on [0, 1]. For details we refer to Ferguson [12] and Lavine [30].

3

Page 4: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

I?

I1

I11

I10

Y10

Y11

I0

I01

I00

Y00

Y01

Y0

Y1

Figure 1: Indexed binary tree with levels l 2 represented. The nodes indexthe intervals I". Edges are labelled with random variables Y".

The corresponding object, the class of Polya tree distributions, is quite flex-ible: as will be seen in the results below, di↵erent behaviours of the sequenceof parameters {↵"} give a Polya tree with di↵erent properties. A standard as-sumption is that the parameters {↵"} only depend on the depth |"|, so that

↵" = al, 8 " : |"| = l, (3)

for any l � 1 and a sequence (al)l�1

of positive numbers, which will be assumedhenceforth.

The class of Polya trees contains as special cases several important distri-butions used in Bayesian nonparametrics. A distinguished special case is theDirichlet process [11], which corresponds to the choice ↵" = 2�|"|, or more gen-erally MG(I") for some M > 0 and a given distribution function G. It can beshown that the corresponding random probability measure is discrete almostsurely and in particular does not provide a prior on densities. On the otherhand, if al goes to 1 fast enough with l, more precisely if

X

l�1

a�1

l < 1 (4)

one can show, see [29] or [35], that a Polya tree distribution on the canonicaldyadic partition has a.s. a density with respect to Lebesgue measure. As weare interested in random density priors, we shall work under that assumption.

The reader familiar with random multifractal structures will certainly havenoticed the similarity of the previous object with random multiplicative cas-cades, as introduced by Mandelbrot and further studied in [28] and many otherssince then. In random cascades, the variables Y" are assumed to be i.i.d., andconservative cascades are those for which Y"1 = 1� Y"0, as we assumed above.An important di↵erence between (density) Polya trees random measures andstandard cascades is that for Polya trees the variables Y"0 along the tree are notassumed i.i.d.: they are still independent, but their distribution may depend onthe level |"| = l in the tree: this is in fact necessary under the assumed (4). Notonly al ! 1 fast enough guarantees the existence of a density, but the faster

4

Page 5: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

al increases with l, the more ‘regular’ the corresponding density. This is partic-ularly important for the approximation ability of the prior and the statisticalproperties considered in the present paper.

Despite this di↵erences, and although we do not use properties of cascadesin the present paper, one may expect that some properties or techniques forcascades could be of interest in the study of Polya trees. We note that, for in-stance, the authors in [38] carry out a wavelet analysis of conservative cascades.In the present paper, such a multiscale analysis of the random measure at stakewill also be central, but with two important di↵erences: the variables Y"0 arenot i.i.d. and, more importantly, we do not study the Polya tree above in itself(we shall use it as prior distribution), but rather the posterior distribution, sothe random measure we analyse also depends on the data X

1

, . . . , Xn and thisdependence is crucial for the statistical properties considered here.

1.2 Function spaces and wavelets

We briefly introduce some standard notation appearing in the statements below.Haar basis. The Haar wavelet basis is {', lk, 0 k < 2l, l � 0}, where

' = 1l[0,1] and, for = �1l

(0,1/2] + 1l(1/2,1],

lk(·) = 2l/2 (2l ·�k), 0 k < 2l, l � 0.

In this paper our interest is in density functions, that is nonnegative functionsg with

´1

0

g' =´1

0

g = 1, so that their first Haar-coe�cient is always 1. So, wewill only need to consider the basis functions lk and will simply write slightlyinformally { lk} for the Haar basis.

Function classes. Let L2 = L2[0, 1] denote the space of square-integrablefunctions on [0, 1] relative to Lebesgue measure equipped with the k · k

2

-norm.

For f, g 2 L2, denote hf, gi2

=´1

0

fg. Let L1 = L1[0, 1] denote the space of allmeasurable functions on [0, 1] that are bounded up to a set of Lebesgue-measure0, equipped with the (essential) supremum norm k · k1.

The class C↵[0, 1], ↵ 2 (0, 1], of Holder functions on the interval [0, 1] is theset of functions g on [0, 1] such that supx 6=y2[0,1] |g(x)� g(y)|/|x� y|↵ is finite.Let us recall that if a function g belongs to C↵, ↵ 2 (0, 1], then the sequence ofits Haar-wavelet coe�cients hg, lki

2

satisfies

sup0k<2

l,l�0

2l(1

2

+↵)|hg, lki2

| < 1. (5)

For a given ↵ > 0, and n � 1, define

"⇤n,↵ =

log n

n

↵2↵+1

.

This is the minimax rate for estimating a density function in a ball of ↵-Holderfunctions, when the supremum-norm is considered as a loss, see [23] and [26].

1.3 Outline

In Section 2, we state our main results. Posterior rates of convergence for thedensity f are considered first. Next, a Donsker-type theorem is established for

5

Page 6: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

the cumulative distribution function, as well as a more general nonparametricBernstein-von Mises theorem. Section 3 gathers the proofs of Theorems 1 and3. The second proof uses some intermediate results obtained in the first one.Section 4 gives some technical results used in the proofs, including two lemmason Beta variables that are of independent interest.

Acknowledgement. The author would like to thank the Associate Editor andtwo referees for insightful comments and suggestions.

2 Main results

2.1 Posterior convergence rates

Our first result shows that, in the problem of density estimation, if a Polya tree isused as a prior distribution on the density, the optimal minimax convergence ratewith respect to the supremum norm is attained by the posterior distribution,provided the parameters of the tree are well chosen. Note that the choice (6)in the Theorem satisfies the summability condition

P

l a�1

l in (4) above, whichensures that the prior distribution has a density. The notation ⇧ is used for thedistribution on densities induced by the considered Polya tree.

We also note that the k · k1-norm in the result is, as defined above, theessential supremum with respect to Lebesgue measure on [0, 1]: the proof ofthe result is based on a Haar-wavelet analysis of the posterior density f , whichidentifies f Lebesgue-almost everywhere. Let, for any reals a, b, denote a ^ b =min(a, b) and a _ b = max(a, b).

Theorem 1. Let X(n) = (X1

, . . . , Xn) be i.i.d. from law P0

with density f0

.Let f

0

belong to C↵[0, 1], for ↵ 2 (0, 1] and suppose f0

is bounded away from0 on [0, 1]. Let ⇧ be the prior on densities generated by a Polya tree randommeasure with respect to the canonical dyadic partition of [0, 1] with parametersA = {↵", " 2 E} chosen as ↵" = a|"| _ 8 for any " 2 E, with

al = l22l↵, l � 0. (6)

Then as n ! 1, for any Mn ! 1, it holds

Enf0

⇧[f : kf � f0

k1 Mn"⇤n,↵ |X(n)] ! 1.

This result implies that for the considered prior, most of the mass of theposterior distribution concentrates in a k · k1 ball around f

0

of radius theminimax rate of convergence. It immediately implies rates for all Lq-norms,1 q < 1, that are minimax optimal up to a logarithmic factor. The choiceof parameters (6) realises an adequate ‘bias-variance’ trade-o↵ for which theoptimal minimax rate "⇤n,↵ is attained.

Theorem 1 assumes that log f0

is bounded. This is for simplicity of presen-tation and could be improved, though it would not add to the ideas we wantto expose here: we preferred to keep a simple condition to make proofs moretransparent.

We also have the following result.

6

Page 7: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

Proposition 1. Under the same assumptions as in Theorem 1, let ⇧ a Polyatree prior ⇧ defined in the same way except that one now sets

al = l22l�, l � 0, (7)

for some � 2 (0, 1] possibly di↵erent from the Holder–regularity ↵ of f0

. Set

"⇤n,↵,� =

log n

n

↵^�2�+1

.

Then as n ! 1, for any Mn ! 1, it holds

Enf0

⇧[f : kf � f0

k1 Mn"⇤n,↵,� |X(n)] ! 1.

The proof of these results can be adapted to handle di↵erent choices ofparameters; we do not elaborate on this in details here but only note that

1. the presence of the factor l in (6) corresponds to the fact that we looked fora sharp optimal minimax rate (up to a constant) in the supremum norm.Removing this factor in the choice of al leads to a rate (log⌘ n)"⇤n,↵ for some⌘ > 0, with an extra logarithmic term, instead of "⇤n,↵ as above (somethingsimilar happens in the Gaussian white noise model with series priors, see[21] and [5]). On the other hand, the presence of an ‘l’ or not does nota↵ect results for most smooth functionals, as will be seen below.

2. results for truncated priors can be obtained similarly, for instance the choice

↵�1

" =

(

1 if |"| = l and l ln

0 if |"| = l and l > ln,(8)

where ln is defined in (12) below with � = ↵, and with the conventionthat Beta(1,1) is the Dirac mass �

1/2 distribution, leads to the sameposterior contraction rate as in Theorem 1. The proof is similar, thougheasier, as one truncates high frequencies. However, it is a n-dependentprior; in contrast the prior (6) is canonical, in the sense that it does notdepend on n. We discuss this further in Section 2.4 below.

So far, only a few results on posterior convergence in the supremum normhave been obtained, see e.g. [21], [5] and [25]. In [5] we suggested a possibleapproach to obtain such results. One of the starting points for the presentpaper is a question of a referee of [5], who asked whether some results fornon-n-dependent priors in density estimation could be obtained. The proof ofTheorem 1 gives another illustration of the approach in [5] and answers thequestion positively.

2.2 Donsker-type theorem

Let us now consider the behaviour of the cumulative distribution functionF (x) =

´ x0

f(t)dt induced by the posterior distribution when a Polya tree isused as prior. Given data X

1

, . . . , Xn, let Fn denote the empirical distributionfunction

Fn(t) =1

n

nX

i=1

1lXit.

7

Page 8: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

For � > 0, define a sequence ↵" = a|"| _ 8 for " 2 E , where

al = 22l�, l � 0 or al = l22l�, l � 0. (9)

For a prior ⇧ on densities induced by a Polya tree distribution with parame-ters as in (9), let fn denote the posterior mean

´fd⇧(f |X) and let Fn(t) =´ t

0

fn(u)du denote its distribution function. In the next result, L(G) denotes thelaw of a process G, and L(F |X) denotes the induced posterior distribution onF . Also, on a metric space S, such as the space of C[0, 1] continuous functionson [0, 1] equipped with the supremum norm, we denote by �S the bounded-Lipschitz metric on S, which metrises weak convergence on S. The definitionof �S is recalled in (27) in the Appendix, where more details can be found.

Theorem 2 (Donsker’s theorem for Polya tree posteriors). Let X = (X1

, . . . , Xn)be i.i.d. from law P

0

with density f0

. Let f0

belong to C↵[0, 1], for some ↵ 2 (0, 1]and suppose f

0

is bounded away from 0 on [0, 1]. Let ⇧ be a Polya tree PT(A)with parameters A = {↵", " 2 E} such that ↵" = a|"| for all " 2 E, with (al) isas in (9) for some � > 0.

Let GP0

be a P0

-Brownian bridge GP0

(t), t 2 [0, 1]. For any parameters↵ 2 (0, 1], � > 0, as n ! 1,

�C[0,1]

L(pn(F � Fn) |X),L(GP

0

)�

!Pf0 0.

Furthermore, for any ↵ 2 (0, 1] and � such that � < 1/2 + ↵, as n ! 1,

�L1[0,1]

L(pn(F � Fn) |X),L(GP

0

)�

!Pf0 0.

In particular, the last display holds true if � 1/2, regardless of the value of ↵.

This result parallels Lo’s result [33] for the Dirichlet process, here in a regimewhere the Polya tree as well as the true law P

0

have a density. A few results ofthis type have been obtained in the literature since then, mostly for priors whoserealisations are discrete measures, like the Dirichlet process, see the introductionof [7] for some references.

A possible route for proving such a result is a direct analysis of the inducedposterior on F (·). Here we use the approach proposed in [7] and obtain it as afairly direct consequence of a more general result on the shape of the posteriordistribution stated in the next section.

2.3 Limiting shape of the posterior distribution

We focus now on limiting shape results for (aspects of) the posterior density f .For this we follow the approach to nonparametric Bernstein–von Mises (BvM)theorems introduced in [7]. The idea is to formulate convergence in distributionof the posterior density to a Gaussian process in a large enough spaceM

0

definedbelow that enables convergence at rate

pn. Once convergence in distribution

on M0

is obtained, it will typically be possible to deduce results via continuousmapping for continuous functionals : M

0

! Y for some given space Y.First we need a sequence of ‘weights’ w := {wl}l�0

, such that wl/pl " 1.

The space M0

= M0

(w) is defined as the multiscale sequence space

M0

=n

x = {xlk} : liml!1

maxk

|xlk|wl

= 0o

, (10)

8

Page 9: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

equipped with the norm kxklk := supl maxk |xlk|/wl. It is a separable Banachspace. A (possibly generalised) function f is said to belong toM

0

if the sequenceof its wavelet coe�cients hf, lki over the Haar basis { lk} belongs to M

0

.Now we define the limiting process. For P a given probability distribution

on [0, 1], let GP be the Gaussian process indexed by the Hilbert space L2(P ) ⌘{f : [0, 1] ! R :

´1

0

f2dP < 1} with covariance function

E [GP (g)GP (h)] =

ˆ1

0

(g � Pg)(h� Ph)dP.

We call GP the P -white bridge process. It can be checked, see [7], that GP ,provided wl/

pl " 1, is a tight Borel Gaussian variable in M

0

.The first statement in Theorem 3 automatically recenters the posterior dis-

tribution around the posterior mean. Typically, one may wish to center in-stead around a ‘canonical’ centering, in that its definition does not dependon the posterior. This can be achieved by comparing fn, for instance, to asmoothed version of the empirical measure. Let Pn denote the empirical mea-sure n�1

Pni=1

�Xi associated to the observed data X and let, for Ln defined in(13) below,

hTn, lki =(

hPn, lki if l Ln

0 if l > Ln,(11)

and Tn is a tight random variable in M0

. For a given � > 0, let jn = jn(�) andln = ln(�) be the largest integers such that

2jn n1

2�+1 , 2ln ✓

n

log n

1

2�+1

. (12)

and set, in slight abuse of notation, either

Ln = jn (8n � 1) or Ln = ln (8n � 1). (13)

As before, let fn denote the posterior mean´fd⇧(f |X).

In the next result, �M0

(w)

denotes the bounded-Lipschitz metric on M0

(w),see (27) in the Appendix for a definition.

Theorem 3. Let X = (X1

, . . . , Xn) be i.i.d. from law P0

with density f0

. Letf0

belong to C↵[0, 1], for some ↵ 2 (0, 1] and suppose f0

is bounded away from 0on [0, 1]. Let ⇧ be a Polya tree PT(A) with parameters A = {al, l � 1}, whereal is as in (9) for some � > 0.

Let ⌧¯fn : f !

pn(f � fn) and let z = {zl}l be a weighting sequence such

that zl/pl " 1. For any parameters ↵ 2 (0, 1], � > 0, as n ! 1,

�M0

(z)(⇧(· |X) � ⌧�1

¯fn,GP

0

) !Pf0 0.

If � ↵, if Tn is given by (11) and ⌧Tn : f !pn(f � Tn), then as n ! 1,

�M0

(z)(⇧(· |X) � ⌧�1

Tn,GP

0

) !Pf0 0.

This result has several applications relevant for statistics. One applica-tion, that we only mention, is the construction of confident credible bands for

9

Page 10: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

fixed regularities, see [7] Section 4.2. Another application is the derivation ofBernstein-von Mises theorems for semiparametric functionals via the continu-ous mapping theorem. A prototypical example is the map f !

´ ·0

f = F (·),leading to a Donsker-type result for the distribution function F , as stated inthe previous section. Indeed, it has been shown in [7] that the map f ! Fis continuous from M

0

(w) to C[0, 1]. Theorem 2 then essentially follows fromTheorem 3 combined with Theorem 4 in [7], see Section 4 for a detailed proof.Results for smooth linear functionals also fairly directly follow from Theorem 3.For details on this and several other examples of functionals, we refer to [6]-[7].

2.4 Discussion

First let us address two natural questions about the results.Are Polya trees not too ‘rough’ as a prior to obtain nontrivial posterior rates

of contraction? A reason to ask is that one can show that any version of theposterior density Polya tree has a jump at any point of the subdivisions of [0, 1]corresponding to the successive partitions, that is at all the dyadic rationals forregular dyadic partitions. But this of course does not prevent the object to havegood approximation properties, similar to the fact that histograms can be usedto approximate e.g. Lipschitz functions (note that Polya tree densities are nothistograms though), and our results show that this is indeed the case. Evenmore, it can be checked that at any non-dyadic point x

0

of [0, 1], the densityinduced by a Polya tree with parameters is locally ↵–Holder at point x

0

. Thisis of course in line with the result of Theorem 1 that for such choice of al theposterior has optimal concentration around ↵–Holder functions.

Is the choice al = 22l↵, or l22l↵, reasonable in practice? Indeed, one maythink that such an increase in the Beta-parameters, that is exponential, couldbe ‘hard to fit’ in practice. The theoretical results show good behaviour of theposterior for this choice though, and we claim that this exponential behaviour isthe ‘correct one’ if one whishes to model all frequencies. Indeed, the exponentialgrowth corresponds to the exponentially fast decrease of the width of dyadicintervals I". It is simply that wavelet coe�cients, which here are modelledthrough products of Beta variables, naturally decrease exponentially fast forHolder classes, see Eq. (5). Similarly, in regression, typical Gaussian processesused as prior distributions have variances decreasing as a power of 2�l, whichis exponentially fast, too. Of course, one may also consider priors that truncatehigh frequencies as in Eq. (8), in which case the contraction rate is essentiallydriven by the cut-o↵ point, not so much by the individual variance parameters,similar to what has been noted e.g. in [39].

The results are also part of a more general programme linked to obtainingBernstein-von Mises results, as well as posterior contraction in strong lossessuch as the supremum norm. For instance, the results are of interest for

1. Bernstein-von Mises theorems for ‘smooth’ functionals. In [8], we obtainedlimiting posterior shape results on a family of random histograms. Onemay note that the truncated version of the Polya tree defined by (8) is alsoa random histogram, but with a quite di↵erent randomness in the weights.Similar to what is noted below Theorem 1, Theorems 2 and 3 above canbe checked to hold for this prior as well. The histograms in [8] can be seenas histograms-projections of the Dirichlet process, giving Dirichlet weights.

10

Page 11: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

Here the weights are not Dirichlet, but correspond to a tree-type productof Beta variables.

2. non-parametric BvM and posterior contraction in supremum norm. In [6],[7], [5], a multiscale approach was developed and a programme to obtainresults of this type was proposed. Only a few examples of priors have beeninvestigated within this framework so far, and investigating other classesof prior distributions is of great interest. One may note for instance thatin the density estimation model, the priors considered in [7], [5] were alln-dependent (note that, in fact, there cannot be a non-n-dependent versionof the random histograms as in [7], [5], as the corresponding underlyinginfinite dimensional prior would be a Dirichlet process, which has no den-sity). The Polya-tree class of priors in the present paper precisely providesan example of such canonical prior.

We plan to study further properties of Polya trees in future work. Amongothers one can mention two natural questions. First, adaptation: here we havestudied the case where the Holder regularity parameter � of f

0

is given. Severalconstructions can be considered to build an adaptive prior, that automaticallyadapts to the unknown regularity �. Second, results for higher regularities, thatis � � 1, would also be of interest.

Extensions to higher regularities. A first question is whether the analysis ofthe present paper could be carried out, for the same Polya tree prior, to higherregularities ↵ > 1 using higher order wavelets. Indeed, one may think thatchoosing al = l22l↵ may continue to lead to optimal rates even when ↵ > 1.We believe this is not the case. One indication why this is presumably not thecase is that under the prior distribution, it can be checked that locally aroundany point x

0

2 [0, 1], the density is not more than locally C1 even if ↵ > 1 (asnoted above, when ↵ 1 the prior density when al = l22l↵ is locally C↵ at anynon-dyadic point). The intuition is that having all Y" independent at a samelevel l = |"| creates ‘too much independence’ between values at di↵erent pointsto produce a highly smooth density.

The second question is whether a di↵erent, cascade–like tree–induced schemefor sequentially defining random masses P (I") of intervals could produce a ran-dom density with a given arbitrary smoothness level ↵ possibly larger than1. Such a construction could for instance be inspired by the schemes defin-ing wavelets bases { lk} that enable to capture smoother regularities (e.g.Daubechies or boundary-corrected wavelets) compared to the Haar basis H

lk .The point is to understand whether this is could be done while still preservinga form of conjugacy: here conjugacy is obtained at a given level with a multi-nomial likelihood on the one hand (the data produce counts NX(I") on eachdyadic I") and a finite-tree prior of Beta distributions for interval probabilitieson the other hand. It is thus quite directly related to a definition of f viainner-products with indicators hf, 1lI"i, which naturally leads to the Haar ba-sis. Would there exist a conjugate structure that would enable to define hf, lkialong a tree–like scheme ? constructions instead. This will be studied elsewhere.

11

Page 12: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

3 Proofs

3.1 Preliminaries and notation

By the standard conjugacy property of Polya trees, see [12], [31], if P followsa PT (A) distribution, the posterior distribution P |X

1

, . . . , Xn follows a Polyatree distribution PT (A⇤) with respect to the same partition and with updatedparameters A⇤ = {↵⇤

" , " 2 E}, where

↵⇤" = ↵" +NX(I"), (14)

with NX(I") =Pn

i=1

I{Xi 2 I"}.The following sets of notation will be used throughout the proofs.

1. Tilded notation, posterior distribution. We denote by P a distribution sam-pled from the posterior distribution and by Y the corresponding variablesY in (1). In particular, the variable Y"0 is Beta(↵⇤

"0,↵⇤"1

) distributed.

2. Bar notation, posterior mean. Let f =´fd⇧(f |X) denote the posterior

mean density and P the corresponding probability measure. We use thenotation Y for the variables defining P via (1).

3. Paths along the tree. A given " = "1

· · · "l 2 E gives rise to a ‘path’"1

! "1

"2

! "1

"2

· · · "l. We denote

I [i]" := I"1

..."i ,

for any i in {1, . . . , l}. Similarly, denote, with EX the expectation underthe posterior distribution,

Y [i]" = Y"

1

..."i , Y [i]" = EX [Y [i]

" ].

Conversely, any pair (l, k) with l � 0 and k 2 {0, . . . , 2l � 1} is associatedwith a unique " = "(l, k), the expression of length l in base 1/2 of k2�l.

For a given distribution P with distribution function F and density f on[0, 1], denote P (B) = F (B) =

´Bf , for any measurable subset B of [0, 1]. In

particular under the ‘true’ distribution, we denote P0

(B) = F0

(B) =´Bf0

. Inthe sequel C denotes a universal constant whose value only depends on otherfixed quantities of the problem.

For a function f in L2, and Ln an integer, denote by fLn the L2-projection off onto the linear span of all elements of the basis { lk} up to level l = Ln. Also,denote fLc

n the projection of f onto the orthocomplement Vect{ lk, l > Ln}. Inthe proofs, we shall use the decomposition f = fLn+fLc

n , which holds in L2 andL1 under prior and posterior: this follows from Lemma 7, which gives su�cientconditions in terms of the sequence (al) for both L2 and L1 statements. Bothconditions of the Lemma are satisfied for (al) of the form (6) or (9).

3.2 Proof of Theorem 1

Proof. Define Ln to be the integer such that

2Ln = bc0

n

log n

1

1+2↵

c, (15)

12

Page 13: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

for c0

a small enough constant to be chosen below.Step 0. Haar decomposition and an event B. First, we define an event B

on the data space. For any integer l, set ⇤n(l)2 := (l + Ln)n/2l. Recall thenotation I lk from Section 1.1. Define B as the event on which, simultaneouslyfor the countable family of indexes l � 1, 0 k < 2l, for M large enough to bechosen,

M�1|NX(I lk)� nF0

(I lk)| ⇤n(l) _ (l + Ln), (16)

where as before NX(I) is the number of data points in I. By Lemma 4, we have

Pnf0

(Bc) = o(1). (17)

Let us now decompose, using the notation above for the projection,

f � f0

= (fLn � fLn) + (fLn � fLn0

) + fLcn � f

Lcn

0

, (18)

which holds in L1 (Lebesgue-almost surely) and in L2.For any given l, k, for " = "(l, k) the expression in base 1/2 of k2�l, let us

write I"(l,k) = I"0 [ I"1. For P a probability measure of density f and { lk}the Haar basis, by definition hf, lki

2

= 2l/2(P (I"1) � P (I"0)). For a functiong, denote by glk its coe�cients onto the Haar basis. If P follows a Polya treedistribution with density f , we thus have the equality in law

flk := hf, lki2

= 2l/2P (I")(1� 2Y"0). (19)

To start with, let us note that

kfLcn

0

k1 =�

X

l>Ln,k

f0,lk lk

1

X

l>Ln

maxk

|f0,lk|

X

k

| lk|�

1.

X

l>Ln

2�l↵ . "⇤n,↵,

using that f0

is Holder and the definition of Ln. We now focus successivelyon each of the remaining terms in the decomposition (18), before putting thebounds together and concluding.

Step 1, term fLn � fLn0

in (18). Given indexes l, k, and " = "(l, k),

flk = 2l/2P (I")(1� 2Y"0)

as well as, with y"0 := F0

(I"0)/F0

(I"),

f0,lk = 2l/2P

0

(I")(1� 2y"0).

This leads to the expression

flk � f0,lk = f

0,lk

P (I")

P0

(I")� 1

+ 2l2

+1P (I")(y"0 � Y"0).

Combining Lemmas 1 and 2 and the fact that F0

(I") . 2�l leads to, on B,

|flk � f0,lk| . |f

0,lk|(

lX

i=1

ai2i

n+

r

Ln2l

n

)

+ (2laln

|f0,lk|+

r

Ln

n)

. |f0,lk|

(

al2l

n+

r

Ln2l

n

)

+

r

Ln

n.

13

Page 14: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

From this deduce that, on the event B,

kfLn � fLn0

k1 .LnX

l=0

2l/2 max0k<2

l�1

|flk � f0,lk|

.LnX

l=0

l2l(↵+1)n�1 +

r

Ln2Ln

n

. 2�Ln↵ +

r

Ln2Ln

n. "⇤n,↵,

where we have used al l22l↵ and |f0,lk| . 2�l(1/2+↵).

Step 2, term fLn�fLn in (18). We define an eventA for which⇧[Ac |X(n)] =o(1). We aim at having each variable Y" defining the posterior law not too farfrom its expectation. In terms of f , this means a control on

´I[i+1]

"f/´I[i]"f for

all admissible i, " = "(l, k) with l Ln. Let A be the measurable set of densitiesf on which, simultaneously for all possible ", i,

|Y [i]" � Y [i]

" | = |ˆI[i]"

f/

ˆI[i�1]

"

f �ˆ ˆ

I[i]"

f/

ˆI[i�1]

"

f

d⇧(f |X(n))|

M

s

Ln

nF0

(I [i]" )=: r[i]" . (20)

Let us check that the complement of A has small posterior probability. By

definition Y[i]" follows a Beta distribution of parameters 'i = ai + NX(I"

1

..."i)

and i = ai+NX(I"1

...(1�"i)). So 'i^ i � ai � 8 and 'i+ i = 2ai+NX(I [i�1]

" ).

Also, one has Y [i]" = 'i/('i + i). By Lemma 2, this ratio is bounded below by

a constant times F0

(I [i]" )/F0

(I [i�1]

" ), for n large enough. Indeed, the remainderterm in Lemma 2 is a o(1) for our choices of al, Ln and using the regularity off0

. So 'i/('i + i) is bounded away from 0 and 1.

Now one can apply Lemma 6, with x = ML1/2n /2 and M a constant to be

chosen below. First one checks that

'i + i � NX(I [i�1]

" ) � NX(I [i]" )

� nF0

(I [i]" )�p

2Lnn2�i

� nF0

(I [i]" )/2,

as nF0

(I [i]" )/2 ⇣ n2�i andp2Lnn2�i = o(n2�i) uniformly in i Ln. Thus for

n large enough, for x = ML1/2n /2, using Remark 1,

P

2

4|Y [i]" � Y [i]

" | > xq

nF0

(I [i]" )

3

5 De�x2/4.

A union bound leads to, for A defined by (20) and d a small enough constant,

⇧[Ac |X(n)] .X

lLn

2le�dM2

logn,

14

Page 15: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

which tends to 0 for M a large enough constant. Now

flk � flk = 2l/2P (I")

"

P (I")

P (I")(1� 2Y"0)� (1� 2Y"0)

#

= 22l/2P (I")(Y"0 � Y"0) + 2l/2P (I")

"

P (I")

P (I")� 1

#

h

1� 2Y"0 + 2(Y"0 � Y"0)i

= 22l/2P (I")(Y"0 � Y"0) +

"

P (I")

P (I")� 1

#

h

flk + 22l/2P (I")(Y"0 � Y"0)i

,

where by definition

P (I")

P (I")� 1 =

lY

j=1

Y[i]"

Y[i]"

� 1.

By Lemma 2, the mean Y"0 is close to F0

(I"0)/F0

(I") when l Ln, and similarlyfor Y"1. In particular it is bounded away from 0 and 1. So on B, one can replace

|Y [i]" � Y

[i]" | in (20) by |Y [i]

" /Y[i]" � 1| up to multiplying the upper bound r

[i]"

in (20) by a universal constant. The conditions of Lemma 3 are satisfied, asLn2Ln/n is a o(1). Deduce that on A and on the event B,

P (I")

P (I")� 1

.l�1

X

i=0

r[i]" .r

Ln2l

n.

On the other hand, we directly have with (20) that |Y"0 � Y"0| .p

Ln2l/n.Conclude that for any f in A and on the event B,

|flk � flk| . |flk|r

Ln2l

n+ 2l/2P (I")

"

r

Ln2l

n+

Ln2l

n

#

. |flk|r

Ln2l

n+

r

Ln

n,

using that, on B, the mean P (I") is within a constant of 2�l. By the triangleinequality |flk| |flk � f

0,lk| + |f0,lk|. Now it su�ces to notice that the terms

induced by flk�f0,lk and f

0,lk respectively have already been dealt with in Step1 above. Deduce that, on B, for f in A,

kfLn � fLnk1 . "⇤n,↵ +

r

Ln2Ln

n. "⇤n,↵.

Step 3, term fLcn in (18). For any R > 1 to be chosen, denoting by EX the

expectation under the posterior distribution,

EXkfLcnk1

X

l>Ln

2l/2EX

maxk

|flk|�

X

l>Ln

2l/2

2

4

2

l�1

X

k=0

EX |flk|R3

5

1/R

,

15

Page 16: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

where we have used Holder’s inequality and bounded the max by the sum. From(19), with " = "(l, k), using independence of Y s given the data along a path inthe tree,

EX |flk|R = 2lR/2EX P (I")REX |1� 2Y"0|R. (21)

First we deal with the last expectation in (21). We apply Lemma 5 with a =al +NX(I"0), d = NX(I"0)�NX(I"1) and

R = Rn := log2

n.

To do so, we check that the condition 2|d| a is satisfied on the event B. LetMl = M((ln/2l)1/2 _ l) be the constant appearing in Lemma 4 when l � Ln.Since Y"0 ⇠ Beta(al +NX(I"0), al +NX(I"1)), we note that, on B,

|NX(I"0)�NX(I"1)| n|F0

(I"0)� F0

(I"1)|+Ml

. n2�l(1+↵) +Ml . Ml,

where we have used that f0

is ↵-Holder, that l > Ln and that

al +NX(I"0) � al + nF0

(I"0)�Ml

& al + n2�l �Ml,

so the condition is satisfied on B for l > Ln. Lemma 5 now implies that

EX [|1� 2Y"0|R] (C|NX(I"0)�NX(I"1)|/al)R + (CR/al)R/2,

(CMl/al)R + (CR/al)

R/2 . (CR/al)R/2

for n large enough so that R = Rn � R0

_M2. For the first expectation termin (21), the formula for the R-th moment of a Beta variable leads to

EX [P (I")R] =

l�1

Y

i=0

R�1

Y

r=0

QX,"(i, r),

QX,"(i, r) =ai +NX(I [i+1]

" ) + r

2ai +NX(I [i]" ) + r.

Let us distinguish the two regimes i Ln and Ln i l. Let us write

NX(I [i]" ) = nF0

(I [i]" ) +M"(i). When i Ln

QX,"(i, r) =F0

(Ii+1

" )

F0

(Ii")

1 + n�1(ai +M"(i+ 1) + r)/F0

(Ii+1

" )

1 + n�1(2ai �M"(i) + r)/F0

(Ii")

F0

(Ii+1

" )

F0

(Ii")

h

1 +ai +M"(i+ 1) + r

nF0

(Ii+1

" )

ih

1 +M"(i)

nF0

(Ii")

i

,

where to obtain the last inequality we have bounded the denominator from belowand used the inequality 1/(1� x) 1 + x for x < 1, as well as nF

0

(Ii") & n2�i

and, for any i Ln, by definition of Ln,

M"(i)2i

n M

h

r

i2i

n_ i2i

n

i

.r

Ln2Ln

n= o(1).

16

Page 17: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

Deduce, for C3

, C4

large enough constants,

QX,"(i, r) F0

(Ii+1

" )

F0

(Ii")

h

1 + C3

2iai +pin2i/2 + 2iRn

n

ih

1 + C3

pin2i/2

n

i

,

F0

(Ii+1

" )

F0

(Ii")

h

1 + C4

2iai +pin2i/2 + 2iRn

n

i

.

This implies that, for some C5

> 0,

LnY

i=0

QX,"(i, r) F0

(ILn+1

" )LnY

i=0

h

1 + C4

2iai +pin2i/2 + 2iRn

n

i

. 2�Ln exp

(

C4

LnX

i=0

2iai +pin2i/2 + 2iRn

n

)

. 2�Ln exp

C5

2LnaLn +pLnn2Ln/2 + 2LnRn

n

. 2�Ln ,

where the last exponential term is bounded due to the definitions of Ln,↵Ln

and Rn. Now in the regime Ln i l,

QX,"(i, r) =ai

1 + NX(I[i+1]

" )

ai+ r

ai

2ai⇣

1 + NX(I[i]" )

2ai+ r

2ai

1

2

1 +NX(I [i+1]

" )

ai+

r

ai

.

This implies that, on B, for some C6

> 0, with M"(i) pin2�i/2 + i,

lY

i=Ln+1

QX,"(i, r) 2�(l�Ln) exp

(

lX

i=Ln+1

nF0

(Ii+1

" ) +M"(i) +Rn

ai

)

2�(l�Ln) exp

(

C6

lX

i=Ln+1

n2�i +pin2�i/2 + i+Rn

ai

)

2�(l�Ln) exp

C6

⇣ n

Ln2Ln(1+2↵)+

pn

pLn2Ln(

1

2

+2↵)+

2Rn

22Ln↵

. 2�(l�Ln),

using again the definition of Ln. This leads to the bound

EX [P (I")R] .

R�1

Y

r=0

(C2�Ln2Ln�l) . (C2�l)R.

Conclude that, with R = Rn = log2

n and some c > 0,

EXkfLcnk1

X

l>Ln

2l/2h

2l2lR/2CR2�lR(CR/al)R/2

i

1/R

.X

l>Ln

2cl/RnR1/2n a

�1/2l .

X

l>Ln

2cl

log

2

n

r

log n

l2�l↵

.X

Ln<l<log

2

n

C2�l↵ +X

l>log

2

n

2(c

log

2

n�↵)l

17

Page 18: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

For n large we have c(log2

n)�1 ⌫↵, for any fixed ⌫ > 0. The first sum in thelast display is less than a constant times "⇤n,↵ and the second sum is less than

n�(1�⌫)↵. By choosing ⌫ < (2↵)/(2↵ + 1), the second sum is thus of smallerorder. Conclude that EXkfLc

nk1 . "⇤n,↵.Now putting together the di↵erent bounds obtained, for any Mn ! 1,

setting Tn := {f : kf � f0

k1 Mn"⇤n,↵} and using Markov’s inequality,

Enf0

⇧[T cn |X] En

f0

⇧[T cn |X]1lB + En

f0

⇧[T cn |X]1lBc

Enf0

⇧[T cn \A |X]1lB + En

f0

⇧[Ac |X] + o(1)

M�1

n "⇤n,↵�1En

f0

EX [1lf2Akf � f0

k1]1lB + o(1)

. M�1

n + o(1) = o(1).

This concludes the proof of Theorem 1.

Lemma 1. Let " 2 E with |"| = l, for some l Ln and Ln defined by (15).Suppose, for any i l Ln, that ai i22↵i. Then, on the event B defined by(16), for c

0

in (15) small enough, for n large enough,

P (I")

P0

(I")� 1

C

"

lX

i=1

ai2i

n+

r

Ln2l

n

#

.

Proof. Notice that P (I") and P0

(I") can be written as the productsQl

i=1

wi

andQl

i=1

yi respectively, with

wi =ai +NX(I [i]" )

2ai +NX(I [i�1]

" ), yi =

F0

(I [i]" )

F0

(I [i�1]

" ).

On the event B, we have NX(I [i]" ) = nF0

(I [i]" ) + �i," where �i," is controlledbelow. That is,

wi = yi1 + n�1(ai + �i,")/F0

(I [i]" )

1 + n�1(2ai + �i�1,")/F0

(I [i�1]

" ).

By definition of B, for l Ln we have |�i,"| CpnLn2�i. Since Ln2Ln =

o(n), this bound is always of smaller order than n2�i . nF0

(I [i]" ), since f0

isbounded away from 0. So the denominator of the last expression is boundedaway from 0. Deduce, for any 1 i Ln,

wi

yi� 1

C

"

ai2i

n+

r

Ln2i

n

#

.

For large n and c0

in (15) small enough, the last display is smaller than 1. Also,

lX

i=1

wi

yi� 1

C

"

lX

i=1

ai2i

n+

r

Ln2l

n

#

,

which remains bounded. An application of Lemma 3 completes the proof.

18

Page 19: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

Lemma 2. Let " 2 E with |"| = l and I" = I lk, for some admissible indexes l, k.Then, on the event B defined by (16), for any l Ln,

Y"0 �F0

(I"0)

F0

(I")

C2l/2

n

2lal+1

|f0,lk|+

p

nLn

.

Proof. Similar to the proof of Lemma 1, let NX(I"0) = nF0

(I"0)+�l+1,", so that

Y"0y"0

� 1�

=�

1 + (al+1

+ �l+1,")/nF0

(I"0)

1 + (2al+1

+ �l,")/nF0

(I")� 1

Cal+1

n|F

0

(I"0)�1 � 2F

0

(I")�1|+ C

n

⇣ |�l,"|F0

(I")+

|�l+1,"|F0

(I"0)

C22lal+1

n2�l/2|f

0,lk|+ C2l/2(nLn)1/2

n,

on B, where we have used the bound |�l,"|+ |�l+1,"| C(nLn2�l)1/2.

Lemma 3. Let {yi}1iL, {wi}1iL be two sequences of positive real numberssuch that there are constants c

1

, c2

with

max1iL

wi

yi� 1

c1

< 1,LX

i=1

wi

yi� 1

c2

< 1.

Then there exists c3

depending on c1

, c2

only such that

LY

i=1

wi

yi� 1

c3

LX

i=1

wi

yi� 1

.

Proof. It su�ces to bound e⇣�1 from above and below, where ⇣ =P

log(wi/yi).For the upper bound, one uses log(1 + u) |u| followed by e|v| � 1 ec2 |v| for|v| c

2

. For the lower bound, one uses log(1+u) � �(1�c1

)�1|u| if |u| c1

< 1followed by e�C|u| � 1 � �C|u|.

3.3 Proof of Theorem 3

Proof. By Lemma 8, it is enough to check that finite-dimensional projectionsconverge (28), and the tightness-type property (29) at rate 1/

pn.

Finite-dimensional projections. First, let us formulate the problem in termsof convergence for histograms.

The finite-dimensional subspace VJ is Vect{', lk, 0 k < 2l, l J}. Notethat, if K = J + 1, it coincides with the space of all histograms on the dyadicregular grid of [0, 1] of meshwidth 2�K . So, if Ji = ((i� 1)2�K , i2�K), one alsohas VJ = Vect{2K1lJi , 1 i 2K} and ⇡VJ f has the explicit expression

⇡VJ f = 2K2

KX

i=1

ˆJi

f⌘

1lJi = 2K2

KX

i=1

F (Ji)1lJi .

The distribution of ⇡VJ f can be equivalently specified by the joint distributionof (F (J

1

), . . . , F (J2

K )).

19

Page 20: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

Below we show that if f is a draw from the posterior distribution ⇧[· |X],

⇣pn(

ˆJi

f � Pn(Ji))⌘

1i2

K! (GP

0

1lJi))1i2

K , (22)

where the convergence is in distribution in R2

K

, in probability under P0

, andPn(I) = NX(I)/n is the mass the empirical measure associated to the dataX

1

, . . . , Xn puts on an interval I.To prove (22), we exhibit a parametric model where the same distributions

as in (22) arise, and where the convergence holds under P0

. Set ⇥ ⌘ S2

K =

{(✓1

, . . . , ✓2

K ) 2 (0, 1)2K

,P

i ✓i = 1} the interior of the unit simplex in R2

K

.Consider the parametric model

P ⌘ PK =n

P = Pg, g = 2K2

KX

i=1

gi1lJi , (g1

, . . . , g2

K ) ⌘ ✓ 2 ⇥o

.

It consists of positive densities that are regular histograms with 2K bins. Asusual the unit simplex ⇥ can be identified to the subset of R2

K�1 consisting of(✓

1

, . . . , ✓2

K�1

) such that 0 < ✓i < 1 for all 1 i 2K � 1 and

2

K�1

X

i=1

✓i < 1.

Define a prior distribution ⇧K on ⇥ viewed as a subset of R2

K�1 by ‘cutting’the Polya tree distribution at level K. That is, define the joint law of (gi)

1i2

K

as the joint law of (F (Ji))1i2

K , where F is sampled from a Polya tree withthe prescribed parameters.

The algebraic expression, given data X1

, . . . , Xn, of the induced posteriordistribution on (F (J

1

), . . . , F (J2

K )) in the original model with the originalPolya tree prior, and the posterior distribution of (g

1

, . . . , g2

K ) in model PK

with prior ⇧K , are the same: this follows by the conjugacy properties of thebeta-distributions with respect to the likelihood, which is of multinomial type.The posterior distribution has, under both models, a tree-type structure: the

posterior of F (I") has same law as a product of Y [i]" ’s, i K, which are Beta

variables with updated parameters ↵⇤" = ↵" +NX(I"). In particular, note that

the joint posterior distribution of (F (J1

), . . . , F (J2

K )) only depends on the datathrough the counts NX(Ji), for 1 i 2K .

Now, the counts (NX(Ji), 1 i 2K), have the same distribution underX ⇠ Pf

0

, the original true model, and under X ⇠ Pf[K]

0

, where

f[K]

0

= 2K2

KX

i=1

F0

(Ji)1lJi

is the L2-projection of f onto PK . This is because the counts are multinomially

distributed with parameters F0

(Ji) = F[K]

0

(Ji), both under Pf0

and Pf[K]

0

.

Deduce that the induced posterior distribution on PK has same law underPf

0

as the posterior in model PK with prior ⇧K and under Pf[K]

0

. To the

latter distribution one can apply the parametric Bernstein-von Mises theorem, as

20

Page 21: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

stated e.g. in [42] Chapter 10, and we now check the corresponding assumptions.The model PK is smoothly parameterised, and in particular di↵erentiable inquadratic mean. The testing condition (10.3) from [42] is easily verified usingHellinger-type tests: denoting by P✓ an arbitrary element of PK , one checks thatfor any ✓, ✓0 2 ⇥, the squared-Hellinger distance between P✓ and P✓0 verifies(✓� ✓0)2 . h2(P✓, P✓0) . (✓� ✓0)2. In this context, the existence of appropriateHellinger tests follows from the works by Le Cam [32] and Birge [3], see e.g. [4]Corollary 1. Finally, the prior ⇧K has a positive density in a neighborhood of

✓0

, the element of the simplex corresponding to f[K]

0

, since all parameters ↵"are strictly positive. Conclude that the posterior in model PK converges in thetotal variation distance

k⇧K [· |X] � ⌧�1 �N(0,�K)kTV !P

f[K]

0 0,

where �K is the inverse Fisher information matrix in the model PK at ✓0

,the element of the simplex in R2

K

with coordinates ✓0,j = F

0

(Ji) (recall that

one may identify the simplex ⇥ with a subset of R2

K�1, dropping out the lastcoordinate, as usual; also, (�K)i,j = ✓

0,i1li=j � ✓0,i✓0,j) and

⌧ : ✓ !pn(✓ � ✓K),

with ✓K := (NX(J1

)/n, . . . , NX(J2

K�1

)/n) = (Pn(J1), . . . ,Pn(J2

K�1

)).Deduce that, in the model PK , for any real numbers b

1

, . . . , b2

K , the posteriordistribution of

pn

2

KX

i=1

bi(✓i � ✓i) =pn

2

K�1

X

i=1

(bi � b2

K )(✓i � ✓i)

converges to (b· � b2

K )T�K(b· � b2

K ). This coincides with

EP0

(GP0

2

KX

i=1

bi1lJi)2 = VarP

0

(2

KX

i=1

bi1lJi).

Indeed, rewriting the expression using 1 =P

2

K

i=1

1lJi ,

VarP0

(2

KX

i=1

bi1lJi) = VarP0

(2

K�1

X

i=1

bi1lJi + b2

K (1�2

K�1

X

i=1

1lJi))

=X

1i,j2

K�1

(bi � b2

K )(bj � b2

K )CovP0

(1lJi , 1lJj ),

where

CovP0

(1lJi , 1lJj ) =

ˆ(1lJi � P

0

(Ji))(1lJi � P0

(Ji))dP0

= 1li=j✓0,i � ✓0,i✓0,j = (�K)i,j ,

recalling that here ✓0

is the vector of ✓0,i = F

0

(Ji) = P0

1lJi , which leads to

VarP0

(2

KX

i=1

bi1lJi) = (b· � b2

K )T�K(b· � b2

K ).

21

Page 22: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

By Cramer-Wold, this shows that the left hand-side of (22) converges in distri-bution to a centered normal limit, with the same covariance structure as that ofthe right-hand-side of (22), in P

0

-probability. This establishes (28), with cen-tering Tn given by (11). Instead of centering at the empirical counts, one canalso center at the posterior mean, as can be checked by a simple computation(this also follows from the bound on flk � flk obtained below and applied forfinite l J).

Tightness. One now needs to check (29). We will exploit several intermediateresults obtained along the proof of Theorem 1. Those are obtained under aspecific choice of al that depends on the regularity of f

0

. Nevertheless, it is easyto check that most statements in that proof remain valid when al is one of thetwo sequences in (9), provided the cut-o↵ level Ln is redefined, for each choiceas in (9), as in (12)-(13), namely

2Ln := Jn := bn1

1+2� c or 2Ln := Jn := b⇣ n

log n

1

1+2� c

respectively. The events B on the data-space and A from the proof of Theorem1 are redefined accordingly, using this definition of Ln.

In particular, we will repeatedly use the bound, established in the proof ofTheorem 1, on the set A and on B, for any l Ln,

P (I")

P (I")� 1

.r

Ln2l

n.

r

Ln2Ln

n= o(1). (23)

Note that establishing this bound did not require any specific smoothness con-ditions on f

0

.We apply Lemma 8 below. First we note that one can work with the posterior

conditioned to the set A, that is ⇧[· |X,A]. This is allowed thanks to Remark 2below Lemma 8. Along the proof, one can go back to the original posterior byusing ⇧[· |X,A] = ⇧[· \ A |X]⇧[A |X]�1. As ⇧[A |X] = 1 + oP (1), this doesnot a↵ect the following argument. To simplify the notation, in the sequel weomit the conditioning on A when writing posterior quantities.

The quantity under expectation on the left-hand side of (29) in Lemma 8is kf � fLnkM

0

(z), that we split into kfLn � fLnkM0

(z) and kfLcnkM

0

(z). Wehave, for say M > 1, and EX denoting expectation under the posterior,

EX

pnmax

lLn

z�1

l maxk

|flk � flk|�

M +

ˆ 1

M

pnmax

lLn

z�1

l maxk

|flk � flk| > u |X�

du

M +X

lLn,k

ˆ 1

M

⇧⇥

|flk � flk| > uzl/pn |X

du.

The di↵erence flk � flk can be bounded in terms of P (I") and Y s: using theidentities linking the function f to the variables Y s obtained in the proof of

22

Page 23: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

Theorem 1, one obtains

|flk � flk|

1 +

"

P (I")

P (I")� 1

#

2l/2+1P (I")|Y"0 � Y"0|+

P (I")

P (I")� 1

|flk|

. 2�l/2|Y"0 � Y"0|+

P (I")

P (I")� 1

|flk|

=: (a) + (b),

where to bound the first term we have used (23) and the fact that P (I") . 2�l

holds on B thanks to Lemma 1 (which, again, holds for the adapted choices of(al) and Ln as above, and ↵ in the statement replaced by �). We now bound(a) and (b) successively.

On the event B, the variable Y"0 is Beta-distributed, with parameters that arewithin constants of nF

0

(I"0) and nF0

(I"1) respectively. Both are thus boundedabove and below by multiples of n2�|"| = n2�l. It now follows from Lemma 6that, for u � M > 1,

(a) � zl2pnu |X

"

|Y"0 � Y"0| �2

l2 zl

4pnu+

2l2 zl

4pn

|X#

"

|Y"0 � Y"0| �2

l2 zl

4pnu+

2

n2�l|X

#

. e�Cz2

l u2

,

where the second inequality holds because zl & 8/(n2�l)1/2 uniformly in l Ln,for n large enough.

We now deal with the term (b) and use the following intermediate boundobtained in the proof of Theorem 1, on B,

|flk| |f0,lk|

(

2lal+1

n+

r

Ln2l

n

)

+

r

Ln

n.

Note that the last bound is always smaller than C2�l/2 for l Ln. So, whenwe evaluate the posterior probability

"

P (I")

P (I")� 1

|flk| >zl

2pnu |X

#

,

one can assume that u pnc

0

2�l/2z�1

l , for c0

an arbitrarily small fixed con-stant, otherwise the posterior probability in the last display is 0 (recall that (23)implies that the term in factor of |flk| in the last display goes to 0). Denote

Ai(u) =

(

f : |Y [i]" (f)� Y [i]

" | uzl

r

2i

n

)

.

23

Page 24: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

Denoting Pr as a shorthand for ⇧[· |X],

Pr⇣

\

il

Ai(u)⌘

Pr⇣

f 2\

il

Ai(u),X

il

|Y [i]" (f)� Y [i]

" | C2

l2 zl

2pnu⌘

Pr⇣

f 2\

il

Ai(u),X

il

Y[i]" (f)

Y[i]"

� 1�

C 0 2l2 zl

2pnu⌘

Pr⇣

lY

i=1

Y[i]" (f)

Y[i]"

� 1

C 0 2l2 zl

2pnu⌘

,

where we have used that Y[i]" is bounded away from 0 and 1, as follows from

Lemma 2, and for the last inequality we have used Lemma 3 together with thefact that uzl(2i/n)1/2 c

0

can be made as small as desired for c0

small enough.This implies, using again that |flk| . 2�l/2, that

(b) >zl

2pnu |X

"

P (I")

P (I")� 1

>2

l2 zl

2pnu |X

#

l

X

i=1

⇧[Ai(u)c |X] le�cz2

l u2

,

as for the term (a) above.Combining the obtained bounds on (a) and (b) leads to, for some c > 0,

EX

pnmax

lLn

z�1

l maxk

|flk � flk|�

. M +X

lLn,k

l

ˆ 1

M

e�cz2

l u2

du

. M +X

lLn

l2le�cz2

l M2

.

The last quantity is bounded as soon as M is chosen large enough (note thatthe previous bound holds in probability, since we work on the event B and haverestricted to the set A with ⇧[A |X] = 1 + oP (1)).

Now we focus on bounding the part l > Ln. Proceeding as in the proof ofTheorem 1, one can obtain bounds for EX [maxk |flk|]. First, using Lemma 5,

EX [|1� 2Y"0|R .✓

CR

al

R2

+

Cn2�l(1+↵) +Ml

al

◆R

.

From this via (21) and as in the proof of Theorem 1 it follows

EX [maxk

|flk|] . 2�l/2

"

l

al

1/2

+n2�l(1+↵) +Ml

al

#

2cl/ log

2

n. (24)

We distinguish the two cases � ↵ and � > ↵. In the undersmoothing case� ↵, the first term in the last bracket dominates, as n2�l(1+↵) . Ml and

24

Page 25: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

Ml (lal)1/2, since both � ↵ and l > Ln. From this we directly deduce that,on the event B, when � ↵, and for any {zl} with zl �

pl,

EX

pnmax

l>Ln

z�1

l maxk

|flk|�

.pn

X

l>Ln

z�1

l

pl2�l/2a

�1/2l 2cl/ log

2

n

.pn

X

l>Ln

2�l/2a�1/2l ,

which is bounded by 1 for both choices of (al) in (9) given our choice of Ln.Also, as EXflk = flk and

pnmax

l>Ln

z�1

l maxk

|flk| EX

pnmax

l>Ln

z�1

l maxk

|flk|�

,

the bound in the last but one display also holds with a di↵erent constant whenflk is replaced by flk � flk.

In the oversmoothing case � > ↵, the first term in the bracket in (24) domi-

nates if n2�l(1+↵)+Ml = o(a1/2l ). This is the case for l � �n, for �n := C log2

nwith C large enough. So for l � �n, one can use the same argument as in thecase � ↵. For Ln l �n, one should work with flk � flk as a whole insteadof separating both terms. It follows from (19) that

flk � flk = �2l/2+1P (I")(Y"0 � Y"0) + 2l/2(P (I")� P (I"))(1� 2Y"0)

= �2l/2+1P (I")(Y"0 � Y"0) +

"

P (I")

P (I")� 1

#

flk

= (i) + (ii).

As Y"0 = EX [Y"0], the term (i) nearly coincides with flk, except that the biashas been substracted from Y"0, so one can use (21) combined with the estimateof E|Y �EY |R obtained in Lemma 5. This leads to the same estimate as in theundersmoothing case, that is

EX

pnmax

l>Ln

z�1

l maxk

|(i)|�

.pn

X

l>Ln

2�l/2a�1/2l . 1.

For the term (ii), to control the term P (I")/P (I") � 1, one proceeds as in theproof of Theorem 1, extending the event A to an event A0 on which, for T largeenough to be chosen below,

|Y [i]" � Y [i]

" | TL1/2n

q

nF0

(I [i]" ) + ai

=: ⇢[i]" ,

for any i |"| and any " such that |"| �n. As in the proof of Theorem 1, onechecks that, on the event B,

⇧[(A0)c |X] .X

l�n

2le�CT 2

logn = o(e�cT 2

logn)

25

Page 26: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

for T large enough, as well as the fact that on A0 and on the event B,�

P (I")

P (I")� 1

.l�1

X

i=0

⇢[i]" .l�1

X

i=0

L1/2np

n2�i + ai.

r

Ln2Ln

n,

since we are in the regime l > Ln (one also uses the fact that Y[i]" is bounded

away from 0 and 1 for both i Ln and i > Ln). Using the expression (ii) onededuces

EX

pn max

Ln<l<�n

z�1

l maxk

|(ii)|1lA0

.pn

r

Ln2Ln

nmax

Ln<l<�n

z�1

l maxk

|flk|

p

Ln2LnEX

maxLn<l<�n

z�1

l maxk

|flk|�

Next one bounds the maximum in l by the sum and uses the general bound(24). This shows that the display is a o(1). Finally, using the rough bound(ii) C2l C2�n and bounding probabilities and Y" by 1, one gets on B

EX

pn max

Ln<l<�n

z�1

l maxk

|(ii)|1lA0c

.pn2�n⇧[(A0)c |X] = o(1)

as ⇧[(A0)c |X] can be made an arbitrary large power of n�1 by choosing T largeenough.

Gathering the bounds obtained for l Ln and l > Ln leads to (29) withcentering at the posterior mean. This proves the first assertion of Theorem 3.

To derive the second assertion of Theorem 3, first note that hTn, lki =hPn, lki equals flk := 2l/2(F (I"1) � F (I"0)) := 2l/2(NX(I"1) � NX(I"0)), for" = "(l, k) and l Ln. It su�ces to show that kfLn � fLnkM

0

(z) is a oP (n�1/2).

Given indexes l, k, and " = "(l, k), we have flk = 2l/2P (I")(1 � 2Y"0). Onthe other hand,

flk = 2l/2(F (I"1)� F (I"0)) = 2l/2F (I")(1� 2F (I"0)

F (I")).

One controls the di↵erence flk � flk in a similar way as for flk � f0,lk in the

proof of Theorem 1. Similar considerations as in Lemmas 1 and 2, but this timewith F (I") playing the role of P

0

(I"), lead to, on the event B as before,

|flk � flk| .2l

nal|flk|+ 2

l2 F (I lk)2

3l2 nal+1

r

Ln

n

. 2l

nal|flk|+ (2�l +

r

Ln

n)22l

nal+1

r

Ln

n

. 2l

nal|f0,lk|+

2l

nal+1

r

Ln

n,

where we have used that |flk| . (2lal/n)|f0,lk| +p

Ln/n on B, as in the proofof Theorem 1.

Consider the case al = 22l�, the case al = l22l� being dealt with similarly.The last term in the above display verifies,

maxlLn

h

z�1

l

2l

nal+1

r

Ln

n

i

z�1

Ln

2Ln

naLn+1

r

Ln

n= o(n�1/2).

26

Page 27: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

As |f0,lk| . 2�l(1/2+↵), deduce, when � ↵, on the event B,

kfLn � fLnkM0

(z) . maxlLn

h

z�1

l

2l

n22l�2�l( 1

2

+�)i

. 2Ln(1

2

+�)

pLnn

= o(n�1/2).

4 Appendix: remaining proofs

4.1 Proof of Proposition 1

One follows the steps of the proof of Theorem 1. The cut-o↵ is taken equal tothe cut-o↵ ln in (12). For low frequencies l ln, one uses the same argumentsas in the proof of Theorem 1 with the new cut-o↵, similar to what is done in

the proof of Theorem 3. For high-frequencies, one separates flcn0

and f lcn . Forthe latter, one uses Lemma 5 and this time both terms on the right-hand sideof the inequality in Lemma 5 matter, depending on how large l is. The proof islargely similar to that of Theorem 1, so details are left to the reader.

4.2 Proof of Theorem 2

Proof. One applies Theorem 4 in [7], in the space M0

(z), where we take thesequence (zl) to be zl = 2l/2/l2, and the centering Tn = fLn , with f definedin the proof of Theorem 3 above. To do so, let us check that the conditions ofTheorem 4 in [7] are satisfied. By definition

P

l zl2�l/2 is finite. Also, it follows

from (the proof of) Theorem 3 that the posterior recentered at fLnn satisfies the

Bernstein-von Mises theorem in M0

(z). One now checks that, for the abovechoice of zl, one has kfLn � fLnkM

0

(z) = oP (1). This is done in a similar wayas in the proof of Theorem 3. The di↵erence is in the estimate involving f

0

: thelast estimate in that proof then becomes

maxlLn

h

z�1

l

2l

n22l�2�l( 1

2

+↵)i

. 2Ln(1

2

+�)

nL2

n2Ln(��↵� 1

2

).

As � < ↵ + 1/2, this bound is a oP (n�1/2), which leads to the first statementof Theorem 2. The second statement follows from the fact that, by a directcomputation one can show that

pnkFLn

n �Fnk1 is a oP (1) whenever � < ↵+1/2,as in the proof leading to Remark 9 in [20].

4.3 An event of small probability

Let Bl be the collection of events defined by, for l � 0 and a sequence Ln ! 1,

Bl =n

max0k<2

l|NX(I lk)� nF

0

(I lk)| M(p

l + Ln

r

n

2l_ (l + Ln))

o

Lemma 4. Let X1

, . . . , Xn be i.i.d. of density f0

on [0, 1], with f0

bounded awayfrom 0 and infinity. Then for M large enough and any Ln ! 1, as n ! 1,

Pnf0

h

[

l�0

Bcl

i

= o(1).

27

Page 28: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

Proof. For given fixed indexes k, l � 1, Bernstein’s inequality applied to thevariables 1l{Xi2Il

k} gives the bound, for y > 0,

Pnf0

|NX(I lk)� nF0

(I lk)| > ny⇤

2 exp

� n2y2/2

nF0

(I lk)(1� F0

(I lk)) + ny/3

2 exp

� Cny2

2�l + y

.

Set ny = nyn to be the term appearing on the right hand side of the sign in thedefinition of Bl. There are two regimes for the index l, depending on which oneof the quantities in the maximum on the right hand side of the definition of Bl

dominates. In each of the regimes, the last display is bounded by C 0e�CM(l+Ln),for any k. Conclude using that

P

l 2le�CM(l+Ln) = o(1) for large M .

4.4 Lemmas on Beta variables

Lemma 5. There exist universal constants a0

, R0

, c1

such that, for any a � a0

,any integer R � R

0

and d a real number such that |d| a/2, if Y follows aBeta(a, a+ d) distribution,

E|1� 2Y |R 2R�1

|EY � 1

2|R + E|Y � EY |R

(c1

d/a)R + (c1

R/a)R/2

Proof. The first inequality follows by convexity of u ! uR on R+. For thefirst term on the right hand side one uses that by definition of Y , |2EY � 1| =d/(2a+ d) d/(3a). For the second term, one writes

E|Y � EY |R = R

ˆ 1

0

P[|Y � EY |R � u]up�1du

R

2

2a+ d

◆R

+R

ˆ 1

2

2a+d

P[|Y � EY |R � u]up�1du.

The last integral is bounded by, using Lemma 6 below and 1/3 a/(2a+ d) 1/2, and denoting s = 2a+ d,

ˆ 1

0

P

|Y � EY |R � wps+

2

s

�✓

wps+

2

s

◆R�1

dwps

1ps

ˆ 1

0

wps+

2

s

◆R�1

De�w2

4 dw

C2R

s�R2

ˆ w

0

wR�1e�w2

4 dw + s1

2

�R

.

By the standard formula on absolute moments of normal variables,

ˆ w

0

wR�1e�w2

4 dw . �

R� 1

2

CR2 . (CR)

R2 .

Combining the previous bounds leads to the result.

28

Page 29: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

Lemma 6. Let ', belong to (0,1). Let Z follow a Beta(', ) distribution.Suppose, for some reals c

0

, c1

,

0 < c0

'/('+ ) c1

< 1 (25)

' ^ > 8. (26)

Then there exists D > 0 depending on c0

, c1

only such that for any x > 0,

P

|Z � E[Z]| > xp'+

+2

'+

De�x2/4.

Remark 1. The bound in Lemma 6 can be read as a sub-Gaussian bound onBeta variables with ‘balanced’ (' and are roughly of the same order via (25))and ‘large enough’ (via (26)) parameters. Under (26) and if x � 1, which isthe case for the applications considered here, the term 2/('+ ) can always beabsorbed in the first term, up to a change in the constants.

Proof. The density of Z can be written eg/´1

0

eg, where for u 2 (0, 1),

g(u) = ('� 1) log u+ ( � 1) log(1� u).

The mean of Z is E[Z] = '/(' + ) and its mode m is the mode of g on(0, 1), noting that g has a unique maximum on (0, 1) if (26) is assumed. Solvingg0(m) = 0 and simple algebra reveal that

|E[Z]�m| =�

'�

('+ � 2)('+ )

2

'+

|'� |'+

2

'+ .

That is, to prove the inequality, it is enough to bound

P [|Z �m| > B] =

´|u�m|>B

eg(u)du´1

0

eg(u)du,

where B ⌘ B(x) = x/p'+ . To do so, we bound numerator and denominator

in the last expression by deriving two bounds on g.The first bound is g(u) � g(m) < �(' + )(u �m)2/4, for any u in (0, 1),

which follows from Taylor’s formula together with the fact that

�g00(u) ='� 1

u2

+ � 1

(1� u)2>

('+ )

2(u�2 ^ (1� u)2) >

'+

2.

The second bound controls g close to m. First suppose m 1/2 and let usbound g on J := (m,m+1/

p'+ ). We claim that J is contained in (c

0

/(1 +8�1), 3/4). The right boundary follows from combining m 1/2 and (26). Theleft boundary is obtained from

m ='+ 1

'+ + 2� '

'+

1

1 + 8�1

� c0

1 + 8�1

,

29

Page 30: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

where the first bound uses (26) and the second bound uses (25). Now for any uin J , we may write g(u) = g(m) + g00(⇣)(u�m)2/2, for some ⇣ 2 J . But

|g00(⇣)| ('+ )(⇣�2 + (1� ⇣)�2).

Using the previous bounds on the endpoints of J , one deduces that

|g(u)� g(m)| c, 8u 2 J,

where c depends on c0

only. In the case that m > 1/2, we instead bound g onJ 0 := (m� 1/

p'+ ,m). Using the symmetric bound

m 1�

'+

1

1 + 8�1

1� 1� c1

1 + 8�1

,

one has J 0 ⇢ (1/4, 1� (1� c1

)/(1 + 8�1)) from which we deduce as before that|g(u)� g(m)| c0 for any u 2 J 0, where c0 depends on c

1

only.Combining the previous bounds, one obtains, in the case m 1/2,

P [|Z �m| > B]

´|u�m|>B

e�('+ )(u�m)

2/4du´Je�cdu

ecˆ|v|>x

e�v2/4dv De�x2/4,

for D large enough, using the standard bound´1x

e�u2

du Ae�x2

, for anyx > 0 and a universal constant A. The case m > 1/2 follows similarly.

4.5 Membership to L2 and L1

Lemma 7. Let ⇧ be the distribution on densities generated by a Polya treewith parameters ↵" = al, for any |"| = l and l � 0, and some sequence (al)l�0

.Under condition (4), that is

1X

l=0

a�1

l < 1,

a density f drawn from ⇧ belongs to L2[0, 1], ⇧-almost surely. In other words

⇧[f :´1

0

f2 < 1] = 1. Under the stronger condition that for some � > 1/2,

1X

l=0

2l�

a�1/2l < 1,

a density f drawn from ⇧ belongs to L1[0, 1], ⇧-almost surely. Moreover, inthis case, ⇧-almost surely, f is also (Lebesgue)-almost everywhere the sum ofits Haar wavelet series. Also, all these statements hold under the posteriordistribution ⇧[· |X] as well.

Proof. For convergence in L2, it is enough to check that the sequence (flk),with flk = hf, lki the Haar wavelet coe�cients of f , is square integrable ⇧-a.s., which is implied if

Eh

X

l,k

f2

lk

i

< 1,

30

Page 31: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

where E denotes the expectation under the prior distribution. From the expres-sion (19), for I" = I"(l,k) one gets E(f2

lk) = 2lE[P (I")2]E(1� 2Y"0)2, and

E[P (I")2] =

l�1

Y

i=0

1

2

ai + 1

2ai + 1= 2�2l

l�1

Y

i=0

1 + a�1

i

1 + (2ai)�1

2�2lePl�1

i=0

a�1

i ,

as well as E(1�2Y"0)2 = 4Var(Y"0) = 1/(2al+1). Deduce that the expectationat stake is finite as soon as (4) holds.

For the supremum norm, one first checks that the series

X

l�1

2

lX

k=1

flk lk

is normally converging ⇧-almost surely. For this it is enough to verify, askP

k | lk|k1 2l/2, that, denoting by E the expectation under the prior dis-tribution,

Eh

X

l

2l/2 maxk

|flk|i

< 1.

Using Holder’s inequality, and next bounding the maximum by the sum, thisexpectation is bounded, for some R > 1, by

X

l

2l/2"

2

l�1

X

k=0

E|flk|R#

1

R

X

l

2l"

2

l�1

X

k=0

EP (I"(l,k))RE|1� 2Y"(l,k)0|R

#

1

R

.

The first expectation in the last display is bounded proceeding as for the L2

norm above, but using the formula for the R-th moment of a Beta variableinstead of the second. This yields

EP (I"(l,k))R =

l�1

Y

i=0

R�1

Y

r=0

↵i + r

2↵i + r= 2�lR

R�1

Y

r=0

l�1

Y

i=0

1 + r/↵i

1 + r/(2↵i)

2�lRR�1

Y

r=0

ePl�1

i=0

r/↵i 2�lReCR2

.

For the second expectation to evaluate, we use Lemma 5 with a = al and d = 0to obtain E|1� 2Y"(l,k)0|R (c

1

R/al)R/2. Combining the previous bounds oneobtains that the considered expectation is bounded by

X

l

2l(2l2�lRecR2

(c1

R/al)R/2)1/R .

X

l

2l/RecR

a1/2l

pR.

Taking R = Cpl, the last sum converges by assumption, which shows normal

convergence ⇧-almost surely. Deduce that the Haar-wavelet series of f ⇧-a.s.converges in L1, to an element say g 2 L1. As the wavelet series converges inL2 to f by the first part of the proof, deduce that

´(f � g)2 = 0, so that f = g

a.e.. That is, f belongs to L1 and coincides with the sum of its wavelet seriesalmost everywhere, ⇧-almost surely. Finally, the statement about the posteriordistribution follows from the fact that, under (4), the Polya tree posterior isabsolutely continuous with respect to the prior, see e.g. [9].

31

Page 32: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

4.6 Weak convergence and BvM phenomenon in M0(w)

Convergence in distribution of random variables Xn !d X in a metric space(S, d) can be metrised by metrising weak convergence of the induced laws L(Xn)to L(X) on S. Here we work with the bounded-Lipschitz metric �S : Let µ, ⌫be probability measures on (S, d) and define

�S(µ, ⌫) ⌘ supF :kFkBL1

ˆS

F (x)(dµ(x)� d⌫(x))

, (27)

kFkBL = supx2S

|F (x)|+ supx 6=y,x,y2S

|F (x)� F (y)|d(x, y)

.

Lemma 8 (Proposition 6 in [7]). Let ⇡VJ , J 2 N, be the projection operator ontothe finite-dimensional space spanned by the lk’s with scales up to l J . Letf ⇠ ⇧(·|X), Tn = Tn(X), let ⇧n denote the laws of

pn(f � Tn) conditionally

on X. Assume that the finite-dimensional distributions converge, that is,

�VJ

⇧n � ⇡�1

VJ,GP

0

� ⇡�1

VJ

!P0 0, as n ! 1, (28)

for all J 2 N, and that for some sequence z = (zl) " 1, zl/pl � 1,

E

supl

z�1

l maxk

|hf � Tn, lki|�

�X

= OP0

1pn

. (29)

Then, for any w such that wl/zl " 1 we have, as n ! 1,

�M0

(w)

(⇧n,GP0

) !P0 0.

Remark 2. The result still holds true if f ⇠ ⇧(·|X) is replaced by f ⇠ ⇧(· |X)for random measures ⇧(· |X) s.t.

�M0

(⇧(· |X),⇧(· |X)) !P0 0

as n ! 1. Likewise, the posterior can be replaced by the conditional posterior⇧(· |X,Dn) for any sequence of sets Dn such that ⇧(Dn |X) !P

0 1.

References

[1] A. Barron, M. J. Schervish, and L. Wasserman. The consistency of posteriordistributions in nonparametric problems. Ann. Statist., 27(2):536–561, 1999.

[2] J. O. Berger and A. Guglielmi. Bayesian and conditional frequentist testing ofa parametric model versus nonparametric alternatives. J. Amer. Statist. Assoc.,96(453):174–184, 2001.

[3] L. Birge. Sur un theoreme de minimax et son application aux tests. Probab.Math. Statist., 3(2):259–282, 1984.

[4] L. Birge. Robust tests for model selection. IMS Collect., 9:47–64, 2013.

[5] I. Castillo. On Bayesian supremum norm contraction rates. Ann. Statist.,42(5):2058–2091, 2014.

32

Page 33: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

[6] I. Castillo and R. Nickl. Nonparametric Bernstein–von Mises Theorems in Gaus-sian white noise. Ann. Statist., 41(4):1999–2028, 2013.

[7] I. Castillo and R. Nickl. On the Bernstein–von Mises phenomenon for nonpara-metric Bayes procedures. Ann. Statist., 42(5):1941–1969, 2014.

[8] I. Castillo and J. Rousseau. A Bernstein–von Mises theorem for smooth function-als in semiparametric models. Ann. Statist., 43(6):2353–2383, 2015.

[9] L. Draghici and R. V. Ramamoorthi. A note on the absolute continuity andsingularity of Polya tree priors and posteriors. Scand. J. Statist., 27(2):299–303,2000.

[10] J. Fabius. Asymptotic behavior of Bayes’ estimates. Ann. Math. Statist., 35:846–856, 1964.

[11] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Ann.Statist., 1:209–230, 1973.

[12] T. S. Ferguson. Prior distributions on spaces of probability measures. Ann.Statist., 2:615–629, 1974.

[13] D. A. Freedman. On the asymptotic behavior of Bayes’ estimates in the discretecase. Ann. Math. Statist., 34:1386–1403, 1963.

[14] D. A. Freedman. On the asymptotic behavior of Bayes estimates in the discretecase. II. Ann. Math. Statist., 36:454–456, 1965.

[15] S. Ghosal. The Dirichlet process, related priors and posterior asymptotics. InBayesian nonparametrics, Camb. Ser. Stat. Probab. Math., pages 35–79. Cam-bridge Univ. Press, Cambridge, 2010.

[16] S. Ghosal, J. K. Ghosh, and R. V. Ramamoorthi. Consistent semiparametricBayesian inference about a location parameter. J. Statist. Plann. Inference,77(2):181–193, 1999.

[17] S. Ghosal, J. K. Ghosh, and A. W. van der Vaart. Convergence rates of posteriordistributions. Ann. Statist., 28(2):500–531, 2000.

[18] S. Ghosal and A. van der Vaart. Convergence rates of posterior distributions fornon-i.i.d. observations. Ann. Statist., 35(1):192–223, 2007.

[19] S. Ghosal and A. van der Vaart. Fundamentals of Nonparametric Bayesian In-ference. 2015. Forthcoming.

[20] E. Gine and R. Nickl. Uniform limit theorems for wavelet density estimators.Ann. Probab., 37(4):1605–1646, 2009.

[21] E. Gine and R. Nickl. Rates on contraction for posterior distributions in Lr-metrics, 1 r 1. Ann. Statist., 39(6):2883–2911, 2011.

[22] T. Hanson and W. O. Johnson. Modeling regression error with a mixture of Polyatrees. J. Amer. Statist. Assoc., 97(460):1020–1033, 2002.

[23] R. Z. Has0minskiı. A lower bound for risks of nonparametric density estimates inthe uniform metric. Teor. Veroyatnost. i Primenen., 23(4):824–828, 1978.

[24] N. Hjort and S. Walker. Quantile pyramids for Bayesian nonparametrics. Ann.Statist., 37(1):105–131, 2009.

33

Page 34: P´olya tree posterior distributions on densitiesGeorge Polya in Annales de l’Institut Henri Poincar´e. It was indeed shown in [34] that Polya trees are de Finetti measures of certain

[25] M. Ho↵mann, J. Rousseau, and J. Schmidt-Hieber. On adaptive posterior con-centration rates. Ann. Statist., 43(5):2259–2295, 2015.

[26] I. A. Ibragimov and R. Z. Has0minskiı. An estimate of the density of a distribution.Zap. Nauchn. Sem. Leningrad. Otdel. Mat. Inst. Steklov. (LOMI), 98:61–85, 161–162, 166, 1980. Studies in mathematical statistics, IV.

[27] A. Jara and T. E. Hanson. A class of mixtures of dependent tail-free processes.Biometrika, 98(3):553–566, 2011.

[28] J.-P. Kahane and J. Peyriere. Sur certaines martingales de Benoit Mandelbrot.Advances in Math., 22(2):131–145, 1976.

[29] C. H. Kraft. A class of distribution function processes which have derivatives. J.Appl. Probability, 1:385–388, 1964.

[30] M. Lavine. Some aspects of Polya tree distributions for statistical modelling.Ann. Statist., 20(3):1222–1235, 1992.

[31] M. Lavine. More aspects of Polya tree distributions for statistical modelling. Ann.Statist., 22(3):1161–1176, 1994.

[32] L. LeCam. Convergence of estimates under dimensionality restrictions. Ann.Statist., 1:38–53, 1973.

[33] A. Lo. Weak convergence for Dirichlet processes. Sankhya, 45(1):105–111, 1983.

[34] R. D. Mauldin, W. D. Sudderth, and S. C. Williams. Polya trees and randomdistributions. Ann. Statist., 20(3):1203–1221, 1992.

[35] M. Metivier. Sur la construction de mesures aleatoires presque surement absolu-ment continues par rapport a une mesure donnee. Z. Wahrscheinlichkeitshtheorieund Verw. Gebiete, 20:332–344, 1971.

[36] L. E. Nieto-Barajas and P. Muller. Rubbery Polya tree. Scand. J. Stat., 39(1):166–184, 2012.

[37] G. Polya. Sur quelques points de la theorie des probabilites. Ann. Inst. H.Poincare, 1(2):117–161, 1930.

[38] S. Resnick, G. Samorodnitsky, A. Gilbert, and W. Willinger. Wavelet analysis ofconservative cascades. Bernoulli, 9(1):97–135, 2003.

[39] V. Rivoirard and J. Rousseau. Posterior concentration rates for infinite dimen-sional exponential families. Bayesian Anal., 7(2):311–333, 2012.

[40] L. Schwartz. On Bayes procedures. Z. Wahrscheinlichkeitstheorie und Verw.Gebiete, 4:10–26, 1965.

[41] X. Shen and L. Wasserman. Rates of convergence of posterior distributions. Ann.Statist., 29(3):687–714, 2001.

[42] A. W. van der Vaart. Asymptotic statistics. Cambridge Univ.Press, Cambridge,1998.

[43] S. Walker. New approaches to Bayesian consistency. Ann. Statist., 32(5):2028–2043, 2004.

[44] W. H. Wong and L. Ma. Optional Polya tree and Bayesian inference. Ann.Statist., 38(3):1433–1459, 2010.

34