indefinitely oscillating martingales · connection between probability measures on in nite strings...

Indefinitely Oscillating Martingales

Jan Leike Marcus Hutter

August 14, 2014

Abstract

We construct a class of nonnegative martingale processes that oscillateindefinitely with high probability. For these processes, we state a uniformrate of the number of oscillations and show that this rate is asymptoticallyclose to the theoretical upper bound. These bounds on probability andexpectation of the number of upcrossings are compared to classical boundsfrom the martingale literature. We discuss two applications. First, ourresults imply that the limit of the minimum description length operatormay not exist. Second, we give bounds on how often one can change one’sbelief in a given hypothesis when observing a stream of data.1

Keywords. martingales, infinite oscillations, bounds, convergencerates, minimum description length, mind changes.

1 Introduction

Martingale processes model fair gambles where knowledge of the past or choiceof betting strategy have no impact on future winnings. But their applicationis not restricted to gambles and stock markets. Here we exploit the connectionbetween nonnegative martingales and probabilistic data streams, i.e., proba-bility measures on infinite strings. For two probability measures P and Q oninfinite strings, the quotient Q/P is a nonnegative P -martingale. Conversely,every nonnegative P -martingale is a multiple of Q/P P -almost everywhere forsome probability measure Q.

One of the famous results of martingale theory is Doob’s Upcrossing In-equality [Doo53]. The inequality states that in expectation, every nonnegativemartingale has only finitely many oscillations (called upcrossings in the martin-gale literature). Moreover, the bound on the expected number of oscillationsis inversely proportional to their magnitude. Closely related is Dubins’ In-equality [Dub62] which asserts that the probability of having many oscillationsdecreases exponentially with their number. These bounds are given with respectto oscillations of fixed magnitude.

In Section 4 we construct a class of nonnegative martingale processes thathave infinitely many oscillations of (by Doob necessarily) decreasing magnitude.These martingales satisfy uniform lower bounds on the probability and the ex-pectation of the number of upcrossings. We prove corresponding upper boundsin Section 5 showing that these lower bounds are asymptotically tight. More-over, the construction of the martingales is agnostic regarding the underlying

1 This is the extended technical report. The conference version can be found at [LH14].

1

probability measure, assuming only mild restrictions on it. We compare theseresults to the statements of Dubins’ Inequality and Doob’s Upcrossing Inequal-ity and demonstrate that our process makes those inequalities asymptoticallytight. If we drop the uniformity requirement, asymptotics arbitrarily close toDoob and Dubins’ bounds are achievable. We discuss two direct applications ofthese bounds.

The Minimum Description Length (MDL) principle [Ris78] and the closelyrelated Minimal Message Length (MML) principle [WB68] recommend to selectamong a class of models the one that has the shortest code length for the dataplus code length for the model. There are many variations, so the followingstatements are generic: for a variety of problem classes MDL’s predictions havebeen shown to converge asymptotically (predictive convergence). For continuousindependently identically distributed data the MDL estimator usually convergesto the true distribution [Gru07, Wal05] (inductive consistency). For arbitrary(non-i.i.d.) countable classes, the MDL estimator’s predictions converge to thoseof the true distribution for single-step predictions [PH05] and ∞-step predic-tions [Hut09]. Inductive consistency implies predictive convergence, but not theother way around. In Section 6 we show that indeed, the MDL estimator forcountable classes is inductively inconsistent. This can be a major obstacle forusing MDL for prediction, since the model used for prediction has to be changedover and over again, incurring the corresponding computational cost.

Another application of martingales is in the theory of mind changes [LS05].How likely is it that your belief in some hypothesis changes by at least α > 0several times while observing some evidence? Davis recently showed [Dav13]using elementary mathematics that this probability decreases exponentially. InSection 7 we rephrase this problem in our setting: the stochastic process

P ( hypothesis | evidence up to time t )

is a martingale bounded between 0 and 1. The upper bound on the probability ofmany changes can thus be derived from Dubins’ Inequality. This yields a simpleralternative proof for Davis’ result. However, because we consider nonnegativebut unbounded martingales, we get a weaker bound than Davis.

2 Strings, Measures, and Martingales

We presuppose basic measure and probability theory [Dur10, Chp.1]. Let Σ be afinite set, called alphabet. We assume Σ contains at least two distinct elements.For every u ∈ Σ∗, the cylinder set

Γu := uv | v ∈ Σω

is the set of all infinite strings of which u is a prefix. Furthermore, fix theσ-algebras

Ft := σ(Γu | u ∈ Σt

)and Fω := σ

( ∞⋃t=1

Ft).

(Ft)t∈N is a filtration: since Γu =⋃a∈Σ Γua, it follows that Ft ⊆ Ft+1 for every

t ∈ N, and all Ft ⊆ Fω by the definition of Fω. An event is a measurable setE ∈ Fω. The event Ec := Σω \ E denotes the complement of E. See also thelist of notation in Appendix A.1.

2

Definition 1 (Stochastic Process). (Xt)t∈N is called (R-valued) stochastic pro-cess iff each Xt is an R-valued random variable.

Definition 2 (Martingale). Let P be a probability measure over (Σω,Fω).An R-valued stochastic process (Xt)t∈N is called a P -supermartingale (P -sub-martingale) iff

(a) each Xt is Ft-measurable, and

(b) E[Xt | Fs] ≤ Xs (E[Xt | Fs] ≥ Xs) almost surely for all s, t ∈ N with s < t.

A process that is both P -supermartingale and P -submartingale is called P -martingale.

We call a supermartingale (submartingale) process (Xt)t∈N nonnegative iffXt ≥ 0 for all t ∈ N.

A stopping time is an (N ∪ ω)-valued random variable T such that v ∈Σω | T (v) = t ∈ Ft for all t ∈ N. Given a supermartingale (Xt)t∈N, the stoppedprocess (Xmint,T)t∈N is a supermartingale [Dur10, Thm. 5.2.6]. If (Xt)t∈Nis bounded, the limit of the stopped process, XT , exists almost surely even ifT = ω (Martingale Convergence Theorem [Dur10, Thm. 5.2.8]). We use thefollowing variant on Doob’s Optional Stopping Theorem for supermartingales.

Theorem 3 (Optional Stopping Theorem [Dur10, Thm. 5.7.6]). Let (Xt)t∈Nbe a nonnegative supermartingale and let T be a stopping time. The randomvariable XT is almost surely well defined and E[XT ] ≤ E[X0].

For two probability measures P and Q on (Σω,Fω), the measure Q is calledabsolutely continuous with respect to P on cylinder sets iff Q(Γu) = 0 for allu ∈ Σ∗ with P (Γu) = 0. We exploit the following two theorems that state theconnection between probability measures on infinite strings and martingales.For two probability measures P and Q the quotient Q/P is a nonnegative P -martingale if Q is absolutely continuous with respect to P on cylinder sets.Conversely, for every nonnegative P -martingale there is a probability measureQ on (Σω,Fω) such that the martingale is P -almost surely a multiple of Q/Pand Q is absolutely continuous with respect to P on cylinder sets.

Theorem 4 (Measures → Martingales [Doo53, II§7 Ex. 3]). Let Q and P betwo probability measures on (Σω,Fω) such that Q is absolutely continuous withrespect to P on cylinder sets. Then the stochastic process (Xt)t∈N,

Xt(v) :=Q(Γv1:t)

P (Γv1:t)

is a nonnegative P -martingale with E[Xt] = 1.

Theorem 5 (Martingales → Measures). Let P be a probability measure on(Σω,Fω) and let (Xt)t∈N be a nonnegative P -martingale with E[Xt] = 1. Thereis a probability measure Q on (Σω,Fω) that is absolutely continuous with respectto P on cylinder sets and for all v ∈ Σω and all t ∈ N with P (Γv1:t) > 0,

Xt(v) =Q(Γv1:t)

P (Γv1:t).

3

For completeness, we provide proofs for Theorem 4 and Theorem 5 in Ap-pendix A.2.

Remark 17 (Absolute continuity and absolute continuity on cylinder sets).A measure Q is called absolutely continuous with respect to P iff Q(A) = 0implies P (A) = 0 for all measurable sets A ∈ Fω. Absolute continuity triviallyimplies absolute continuity on cylinder sets. However, the converse is not true:absolute continuity on cylinder sets is a strictly weaker condition than absolutecontinuity.

Let P be a Bernoulli(2/3) and Q be a Bernoulli(1/3) process. Formally, wefix Σ = 0, 1 and define for all u ∈ Σ∗,

P (Γu) :=(

23

)ones(u) ( 13

)zeros(u),

Q(Γu) :=(

13

)ones(u) ( 23

)zeros(u),

where ones(u) denotes the number of ones in u and zeros(u) denotes the numberof zeros in u. Both measures P and Q are nonzero on all cylinder sets: Q(Γu) ≥3−|u| > 0 and P (Γu) ≥ 3−|u| > 0 for every u ∈ Σ∗. Therefore Q is absolutelycontinuous with respect to P on cylinder sets. However, Q is not absolutelycontinuous with respect to P : define

A :=

v ∈ Σω | lim sup

t→∞1t ones(v1:t) ≤ 1

2

.

The set A is Fω-measurable since A =⋂∞n=1

⋃u∈Un Γu with Un := u ∈ Σ∗ |

|u| ≥ n and ones(u) ≤ |u|/2, the set of all finite strings of length at least n thathave at least as many zeros as ones. We have that P (A) = 0 and Q(A) = 1,hence Q is not absolutely continuous with respect to P .

While Theorem 4 trivially also holds if Q is absolutely continuous with re-spect to P , Theorem 5 does not imply that Q is absolutely continuous withrespect to P . Consider the process X0(v) := 1,

Xt+1(v) :=

2Xt, if vt+1 = 0 and12Xt, if vt+1 = 1.

The process (Xt)t∈N is a nonnegative P -martingale since every Xt is Ft-measurable and for u = v1:t we have

E[Xt+1 | Ft](v) = P (Γu0 | Γu)2Xt(v) + P (Γu1 | Γu) 12Xt(v)

= 132Xt(v) + 2

3 ·12Xt(v) = Xt(v).

Moreover,

Q(Γu) =(

13

)ones(u) ( 23

)zeros(u)

=(

23

)ones(u) ( 13

)zeros(u)2−ones(u)2zeros(u) = P (Γu)Xt(v).

Hence Xt(v) = Q(Γv1:t)/P (Γv1:t) P -almost surely. The measure Q is uniquelydefined by its values on the cylinder sets, and as shown above, Q is not absolutelycontinuous with respect to P .

4

3 Martingale Upcrossings

Fix c ∈ R and ε > 0, and let (Xt)t∈N be a martingale over the probability space(Σω,Fω, P ). Let t1 < t2. We say the process (Xt)t∈N does an ε-upcrossingbetween t1 and t2 iff Xt1 ≤ c−ε and Xt2 ≥ c+ε. Similarly, we say (Xt)t∈N doesan ε-downcrossing between t1 and t2 iff Xt1 ≥ c + ε and Xt2 ≤ c − ε. Exceptfor the first upcrossing, consecutive upcrossings always involve intermediatedowncrossings. Formally, we define the stopping times

T0(v) := 0,

T2k+1(v) := inft > T2k(v) | Xt(v) ≤ c− ε, and

T2k+2(v) := inft > T2k+1(v) | Xt(v) ≥ c+ ε.

The T2k(v) denote the indexes of upcrossings. We count the number of upcross-ings with the random variable UXt (c− ε, c+ ε), where

UXt (c− ε, c+ ε)(v) := supk ≥ 0 | T2k(v) ≤ t

and UX(c − ε, c + ε) := supt∈N UXt (c − ε, c + ε) denotes the total number of

upcrossings. We omit the superscript X if the martingale (Xt)t∈N is clear fromcontext.

The following notation is used in the proofs. Given a monotone decreasingfunction f : N → [0, 1) and m, k ∈ N, we define the event EX,fm,k that there areat least k-many f(m)-upcrossings:

EX,fm,k :=v ∈ Σω | UX(1− f(m), 1 + f(m))(v) ≥ k

.

For all m, k ∈ N we have EX,fm,k ⊇ EX,fm,k+1 and EX,fm,k ⊆ EX,fm+1,k. Again, we omitX and f in the superscript if they are clear from context.

4 Indefinitely Oscillating Martingales

In this section we construct a class of martingales that has a high probabilityof doing an infinite number of upcrossings. The magnitude of the upcrossingsdecreases at a rate of a given summable function f (a function f is calledsummable iff it has finite L1-norm, i.e.,

∑∞i=1 f(i) < ∞), and the value of

the martingale Xt oscillates back and forth between 1− f(Mt) and 1 + f(Mt),where Mt denotes the number of upcrossings so far. The process has a monotonedecreasing chance of escaping the oscillation. We need the following conditionon the probability measure P .

Definition 18 (Perpetual Entropy). A probability measure P has perpetualentropy iff there is an ε > 0 such that for every u ∈ Σ∗ and v ∈ Σω withP (Γu) > 0 there is an a ∈ Σ and a t ∈ N with 1− ε > P (Γuv1:ta | Γuv1:t) > ε.

This condition states that after seeing some string u ∈ Σ∗, there is alwayssome future time point where there are two symbols that both have conditionalprobability greater than ε. In other words, observing data distributed accordingto P , we almost surely never run out of symbols with significant entropy. Thisis stronger than demanding that the observed string is nonconstant with highprobability, because we get a single lower bound ε for all observed strings u.

5

Theorem 6 (An indefinitely oscillating martingale). Let 0 < δ < 1/2 and letf : N→ [0, 1) be any monotone decreasing function such that

∑∞i=1 f(i) ≤ δ/2.

For every probability measure P with perpetual entropy there is a nonnegativemartingale (Xt)t∈N with E[Xt] = 1 and

P [∀m. U(1− f(m), 1 + f(m)) ≥ m] ≥ 1− δ.

Proof. By grouping symbols from Σ into two groups, we can without loss ofgenerality assume that Σ = 0, 1. Since P (Γu0 | Γu) + P (Γu1 | Γu) = 1, wecan define a function a : Σ∗ → Σ that assigns to every string u ∈ Σ∗ a symbolau := a(u) such that pu := P (Γuau | Γu) ≤ 1

2 . In Claim 7 we show that withoutloss of generality, we can group such that pu > ε infinitely often for some ε > 0.

In the following we define the stochastic process (Xt)t∈N. This process de-pends on the random variables Mt and γt, which are defined below. Let v ∈ Σω

and t ∈ N be given and define u := v1:t. For t = 0, we set X0(v) := 1, and ifpu = 0, we set Xt+1 = Xt. Otherwise we distinguish the following three cases.

(i) For Xt(v) ≥ 1:

Xt+1(v) :=

1− f(Mt(v)) if vt+1 6= au,

Xt(v) + 1−pupu

(Xt(v)− (1− f(Mt(v)))) if vt+1 = au.

(ii) For 1 > Xt(v) ≥ γt(v):

Xt+1(v) :=

Xt(v)− γt(v) if vt+1 6= au,

1 + f(Mt(v)) if vt+1 = au.

(iii) For Xt(v) < γt(v) and Xt(v) < 1:let dt(v) := min pu

1−puXt(v), 1−pupu

γt(v)− 2f(Mt(v));

Xt+1(v) :=

Xt(v) + dt(v) if vt+1 6= au,

Xt(v)− 1−pupu

dt(v) if vt+1 = au.

The random variables Mt and γt are defined as

γt(v) := pu1−pu

(1 + f(Mt(v))−Xt(v)

)Mt(v) := 1 + arg max

m∈N

∀k ≤ m. UXt (1− f(k), 1 + f(k)) ≥ k

,

i.e., Mt is 1 plus the number of upcrossings completed up to time t.We give an intuition for the behavior of the process (Xt)t∈N. For all m, the

following repeats. First Xt increases while reading au’s until it reads one symbolthat is not au and then jumps down to 1 − f(m). Subsequently, Xt decreaseswhile not reading au’s until it falls below γt or reads an au and then jumps upto 1 + f(m). If it falls below 1 and γt, then at every step, it can either jumpup to 1− f(m) or jump down to 0, whichever one is closest (the distance to theclosest of the two is given by dt). See Figure 1 for a visualization.

6

t

Xt

1

1 + f(Mt)

1− f(Mt)

Figure 1: An example evaluation of the martingale defined in the proof ofTheorem 6.

For notational convenience, in the following we omit writing the argumentv to the random variables Xt, γt, Mt, and dt.

Claim 1: (Xt)t∈N is a martingale. Each Xt+1 is Ft+1-measurable, since ituses only the first t+ 1 symbols of v. Writing out cases (i), (ii), and (iii), we get

E[Xt+1 | Ft](i)= (1− f(Mt))(1− pu) +

(Xt + 1−pu

pu(Xt − (1− f(Mt)))

)pu = Xt,

E[Xt+1 | Ft](ii)=(Xt − pu

1−pu ((1 + f(Mt))−Xt))(1− pu) + (1 + f(Mt))pu = Xt,

E[Xt+1 | Ft](iii)= (Xt + dt)(1− pu) + (Xt − 1−pu

pudt)pu = Xt.

Claim 2: If Xt ≥ 1− f(Mt) then Xt > γt. In this case

γt = pu1−pu (1 + f(Mt)−Xt) ≤ 2 pu

1−pu f(Mt),

and thus with pu ≤ 12 and f(Mt) ≤

∑∞k=1 f(k) ≤ δ

2 <14 <

13 ,

Xt − γt ≥ 1− f(Mt)− 2 pu1−pu f(Mt) = 1− 1+pu

1−pu f(Mt) ≥ 1− 3f(Mt) > 0.

Claim 3: If pu > 0, Xt < γt, and Xt < 1 then dt ≥ 0. We have pu1−puXt ≥ 0

since pu > 0 and Xt ≥ 0. Moreover, 1−pupu

γt − 2f(Mt) = 1− f(Mt)−Xt > 0 bythe contrapositive of Claim 2.

Claim 4: The following holds for cases (i), (ii), and (iii).

(a) In case (i): Xt+1 ≥ Xt or Xt+1 = 1− f(Mt).

(b) In case (ii): Xt+1 ≤ Xt or Xt+1 = 1 + f(Mt).

(c) In case (iii): Xt < 1− f(Mt) and Xt+1 ≤ 1− f(Mt).

If pu = 0 then Xt+1 = Xt, so (a) and (b) hold trivially. Otherwise, for(a) we have 1−pu

pu> 0 and Xt ≥ 1 − f(Mt). For (b) we have γt > 0 since

Xt < 1 + f(Mt). For (c), Xt < 1 − f(Mt) follows from the contrapositive ofClaim 2. If pu > 0 then by Claim 3 we have dt ≥ 0 and hence Xt+1 ≤ Xt+dt ≤Xt + (1 + f(Mt)−Xt)− 2f(Mt) = 1− f(Mt).

Claim 5: Xt ≥ 0 and E[Xt] = 1. The latter follows from

E[Xt] = E[E[Xt | Ft−1]] = E[Xt−1] = . . . = E[X0] = 1.

Regarding the former, we use 0 ≤ f(Mt) < 1 to conclude

7

(i6=) 1− f(Mt) ≥ 0,

(i=) 1−pupu

(Xt − (1− f(Mt))) ≥ 0 for Xt ≥ 1,

(ii6=) Xt − γt ≥ 0 for Xt ≥ γt,

(ii=) 1 + f(Mt) ≥ 0,

(iii6=) Xt + dt ≥ 0 since dt ≥ 0 by Claim 3, and

(iii=) Xt − 1−pupu

dt ≥ 0 since dt ≤ pu1−puXt.

Claim 6: Xt ≤ 1−f(Mt) or Xt ≥ 1+f(Mt) for all t ≥ T1. We use inductionon t: the induction start holds with XT1

≤ 1 − f(Mt) and the induction stepfollows from Claim 4.

Claim 7: P (v ∈ Σω | pv1:t > ε for infinitely many t) = 1 for some ε > 0.By assumption P has perpetual entropy; let ε be as in Definition 18.

A := v ∈ Σω | P (Γv1:t) > 0 for all t

Its complement Ac =⋃u∈Σ∗:P (Γu)=0 Γu is the countable union of null sets and

therefore P (A) = 1. Let v ∈ A be some outcome, let t ∈ N be the current timestep, and define u := v1:t. Because P has perpetual entropy and P (Γu) > 0since v ∈ A, there exists u′ ∈ Σ∗, a ∈ Σ, and v′ ∈ Σω such that v = uu′av′

and 1 − ε > P (Γuu′a | Γuu′) > ε. If P (Γuu′a | Γuu′) ≤ 1/2 we can selectauu′ := a; if P (Γuu′a | Γuu′) > 1/2 then, with abuse of notation, for the symbolgroup b := Σ \ a we have ε < P (Γuu′b | Γuu′) ≤ 1/2 and hence we can selectauu′ := b. In either case puu′ > ε for a suitable grouping of symbols.

Claim 8: (Xt)t∈N converges almost surely to a random variable Xω ∈ 0, 1.According to the Martingale Convergence Theorem [Dur10, Thm. 5.2.8], theprocess (Xt)t∈N converges almost surely to a random variable Xω. Assume thatXω attains some value xω other than 0 and 1. Pick an ε′ > 0 such that |xω| > 2ε′

and |1 − xω| > 2ε′. Since Xt → xω we have |xω −Xt| < ε′ for all but finitelymany t, and hence there is a t0 ∈ N such that |Xt| > ε′ and |1 − Xt| > ε′ forall t ≥ t0. Recall that ε > 0 is fixed and depends only on P . Below we showfor cases (i), (ii), and (iii) that |Xt+1 − Xt| > minε · ε′, ε′, 1

8 if pu > ε. ByClaim 7 we almost surely have infinitely many t ≥ t0 with pu > ε, which is acontradiction to the fact that (Xt)t∈N converges almost surely.

(i) Assume Xt ≥ 1, then Xt > 1 + ε′ by assumption. Either Xt+1 = 1 −f(Mt) ≤ 1 < Xt − ε′ or Xt+1 = Xt + 1−pu

pu(Xt − 1 + f(Mt)) > Xt + (Xt −

1 + f(Mt)) ≥ Xt + (Xt − 1) > Xt + ε′ because pu ≤ 12 implies 1−pu

pu≥ 1.

(ii) Assume γt ≤ Xt < 1, then ε′ < Xt < 1− ε′. Either Xt+1 = 1 + f(Mt) ≥1 > Xt + ε′ or Xt+1 = Xt − γt and thus Xt − Xt+1 = γt = pu

1−pu (1 +f(Mt)−Xt) > ε(1 + f(Mt)−Xt) ≥ ε(1−Xt) > εε′.

(iii) Assume Xt < γt and Xt < 1, then since 0 ≤ Xt by Claim 5, ε′ < Xt < γtand Xt < 1 − ε′. Either dt = pu

1−puXt > εε′ and we are done, or dt =1−pupu

γt − 2f(Mt). If Xt ≥ 58 then dt >

1−pupu

Xt − 2f(Mt) > Xt − 12 ≥

18 ,

since f(Mt) ≤ δ2 <

14 . If Xt <

58 then dt = 1− f(Mt)−Xt >

34 −Xt >

18 .

Hence either Xt+1−Xt = dt > minεε′, 18 or Xt−Xt+1 = 1−pu

pudt > dt >

minεε′, 18.

8

Claim 9: For all m ∈ N, if Em,m−1 6= ∅ then P (Em,m | Em,m−1) ≥ 1 −2f(m). Let v ∈ Em,m−1 and let t0 ∈ N be a time step such that exactlym − 1 upcrossings have been completed up to time t0, i.e., Mt0(v) = m. Thesubsequent downcrossing is completed eventually with probability 1: we arein case (i) and in every step there is a chance of 1 − pu ≥ 1

2 of completingthe downcrossing. Therefore we assume without loss of generality that thedowncrossing has been completed, i.e., that t0 is such that Xt0(v) = 1− f(m).We will bound the probability p := P (Em,m | Em,m−1) that Xt rises above1 + f(m) after t0 to complete the m-th upcrossing.

Define the stopping time T : Σω → N ∪ ω,

T (v) := inft ≥ t0 | Xt(v) ≥ 1 + f(m) ∨ Xt(v) = 0,

and define the stochastic process Yt = 1 + f(m) − Xmint0+t,T. Because(Xmint0+t,T)t∈N is martingale, (Yt)t∈N is martingale. By definition, Xt al-ways stops at 1 + f(m) before exceeding it, thus XT ≤ 1 + f(m), and hence(Yt)t∈N is nonnegative. The Optional Stopping Theorem yields E[YT−t0 | Ft0 ] ≤E[Y0 | Ft0 ] and thus E[XT | Ft0 ] ≥ E[Xt0 | Ft0 ] = 1 − f(m). We show thatXT ∈ 0, 1 + f(m) almost surely. If T is finite then this holds by definition ofT . If T = ω then the random variable XT is defined as the limit limt→∞Xt. ByClaim 8 the limit XT ∈ 0, 1 and according to Claim 6 we have Xt ≤ 1−f(Mt)for all t ∈ N, so Xt cannot converge to 1. We conclude that

1− f(m) ≤ E[XT | Ft0 ] = (1 + f(m)) · p+ 0 · (1− p),

hence P (Em,m | Em,m−1) = p ≥ 1− f(m)(1 + p) ≥ 1− 2f(m).Claim 10: Em+1,m = Em,m and Em+1,m+1 ⊆ Em,m. By definition of Mt,

the i-th upcrossings of the process (Xt)t∈N is between 1−f(i) and 1+f(i). Thefunction f is monotone decreasing, and by Claim 6 the process (Xt)t∈N does notassume values between 1 − f(i) and 1 + f(i). Therefore the first m f(m + 1)-upcrossings are also f(m)-upcrossings, i.e., Em+1,m ⊆ Em,m. By definition ofEm,k we have Em+1,m ⊇ Em,m and Em+1,m+1 ⊆ Em+1,m.

Claim 11: P (Em,m) ≥ 1−∑mi=1 2f(i). For P (E0,0) = 1 this holds trivially.

Using Claim 9 and Claim 10 we conclude inductively

P (Em,m) = P (Em,m ∩ Em,m−1) = P (Em,m | Em,m−1)P (Em,m−1)

= P (Em,m | Em,m−1)P (Em−1,m−1)

≥ (1− 2f(m))

(1−

m−1∑i=1

2f(i)

)≥ 1−

m∑i=1

2f(i).

From Claim 10 follows⋂mi=1Ei,i = Em,m and therefore P (

⋂∞i=1Ei,i) =

limm→∞ P (Em,m) ≥ 1−∑∞i=1 2f(i) ≥ 1− δ.

Theorem 6 gives a uniform lower bound on the probability for manyupcrossings: it states the probability of the event that for all m ∈ N,U(1 − f(m), 1 + f(m)) ≥ m holds. This is a lot stronger than the nonuni-form bound P [U(1− f(m), 1 + f(m)) ≥ m] ≥ 1− δ for all m ∈ N: the quantifieris inside the probability statement.

As an immediate consequence of Theorem 6, we get the following uniformlower bound on the expected number of upcrossings.

9

Corollary 7 (Expected Upcrossings). Let 0 < δ < 1/2 and let f : N → [0, 1)be any monotone decreasing function such that

∑∞i=1 f(i) ≤ δ/2. For every

probability measure P with perpetual entropy there is a nonnegative martingale(Xt)t∈N with E[Xt] = 1 and for all m ∈ N,

E[U(1− f(m), 1 + f(m))] ≥ m(1− δ).

Proof. From Theorem 6 and Markov’s inequality.

By choosing a specific slowly decreasing but summable function f , we getthe following concrete results.

Corollary 8 (Concrete lower bound). Let 0 < δ < 1/2. For every probabilitymeasure P with perpetual entropy there is a nonnegative martingale (Xt)t∈Nwith E[Xt] = 1 such that

P

[∀ε > 0. U(1− ε, 1 + ε) ∈ Ω

(δ

ε(ln 1ε )

2

)]≥ 1− δ and

E[U(1− ε, 1 + ε)] ∈ Ω(

1

ε(ln 1ε )

2

).

Moreover, for all ε < 0.015 we get E[U(1− ε, 1 + ε)] > δ(1−δ)ε(ln 1

ε )2 and

P

[∀ε < 0.015. U(1− ε, 1 + ε) > δ

ε(ln 1ε )

2

]≥ 1− δ.

Proof. Define

g : (0, e−2]→ [0,∞), ε 7→ 2δ

(1

ε(ln ε)2− e2

4

).

We have g(e−2) = 0, limε→0 g(ε) =∞, and

dg

dε(ε) = 2δ

(−1

ε2(ln ε)2+

−2

ε2(ln ε)3

)= −2δ(2 + ln ε)

ε2(ln ε)3< 0 on (0, e−2).

Therefore the function g is strictly monotone decreasing and hence invertible.Choose f := g−1. Using the substitution t = g(ε), dt = dg

dε (ε)dε,

∞∑t=1

f(t) ≤∫ ∞

0

f(t)dt =

∫ g−1(∞)

g−1(0)

f(g(ε))dg

dε(ε)dε

= 2δ

(∫ 0

e−2

−1

ε(ln ε)2dε+

∫ 0

e−2

−2

ε(ln ε)3dε

)= 2δ

([1

ln ε

]0e−2 +

[1

(ln ε)2

]0e−2

)= 2δ

(12 −

14

)= δ

2 .

Now we apply Theorem 6 and Corollary 7 to m := g(ε) and get

P[U(1− ε, 1 + ε) ≥ 2δ

(1

ε(ln ε)2 −e2

4

)]≥ 1− δ, and

E[U(1− ε, 1 + ε)] ≥ 2δ(1− δ)(

1ε(ln ε)2 −

e2

4

).

For ε < 0.015, we have 1ε(ln ε)2

> e2

2 , hence g(ε) > δε(ln ε)2 .

The concrete bounds given in Corollary 8 are not the asymptotically opti-mal ones: there are summable functions that decrease even more slowly. Forexample, we could multiply the function g with the factor

√ln(1/ε) (which still

is not optimal).

10

5 Martingale Upper Bounds

In this section we state upper bounds on the probability and expectations ofmany upcrossings (Dubins’ Inequality and Doob’s Upcrossing Inequality). Weuse the construction from the previous section to show that these bounds areasymptotically tight. Moreover, with the following theorem we show that theuniform lower bound on the probability of many upcrossings guaranteed inTheorem 6 is also asymptotically tight.

Every function f is either summable or not. If f is summable, then wecan scale it with a constant factor such that its sum is smaller than δ

2 , andthen apply the construction of Theorem 6. If f is not summable, the followingtheorem implies that there is no uniform lower bound on the probability ofhaving at least m-many f(m)-upcrossings.

Theorem 9 (Upper bound on upcrossing rate). Let f : N→ [0, 1) be a mono-tone decreasing function such that

∑∞t=1 f(t) = ∞. For every probability mea-

sure P and for every nonnegative P -martingale (Xt)t∈N with E[Xt] = 1,

P [∀m. U(1− f(m), 1 + f(m)) ≥ m] = 0.

Proof. Define the events Dm :=⋃mi=1E

ci,i = ∀i ≤ m. U(1−f(i), 1+f(i)) ≥ i.

Then Dm ⊆ Dm+1. Assume there is a constant c > 0 such that c ≤ P (Dcm) =

P (⋂mi=1Ei,i) for all m. Let m ∈ N, v ∈ Dc

m, and pick t0 ∈ N such that theprocess X0(v), . . . , Xt0(v) has completed i-many f(i)-upcrossings for all i ≤ mand Xt0(v) ≤ 1 − f(m + 1). If Xt(v) ≥ 1 + f(m + 1) for some t ≥ t0, the(m+1)-st upcrossing for f(m+1) is completed and thus v ∈ Em+1,m+1. Definethe stopping time T : Σω → (N ∪ ω),

T (v) := inft ≥ t0 | Xt(v) ≥ 1 + f(m+ 1).

According to the Optional Stopping Theorem applied to the process (Xt)t≥t0 ,the random variable XT is almost surely well-defined and E[XT | Ft0 ] ≤ E[Xt0 |Ft0 ] = Xt0 . This yields 1 − f(m + 1) ≥ Xt0 ≥ E[XT | Ft0 ] and by taking theexpectation E[ · | Xt0 ≤ 1− f(m+ 1)] on both sides,

1− f(m+ 1) ≥ E[XT | Xt0 ≤ 1− f(m+ 1)]

≥ (1 + f(m+ 1))P [XT ≥ 1 + f(m+ 1) | Xt0 ≤ 1− f(m+ 1)]

by Markov’s inequality. Therefore

P (Em+1,m+1 | Dcm) = P [XT ≥ 1 + f(m+ 1) | Xt0 ≤ 1− f(m+ 1)]

· P [Xt0 ≤ 1− f(m+ 1) | Dcm]

≤ P [XT ≥ 1 + f(m+ 1) | Xt0 ≤ 1− f(m+ 1)]

≤ 1−f(m+1)1+f(m+1) ≤ 1− f(m+ 1).

Together with c ≤ P (Dcm) we get

P (Dm+1 \Dm) = P(Ecm+1,m+1 ∩Dc

m

)= P

(Ecm+1,m+1 | Dc

m

)P (Dc

m) ≥ f(m+ 1)c.

11

This is a contradiction because∑∞i=1 f(i) =∞:

1 ≥ P (Dm+1) = P

(m⊎i=1

(Di+1 \Di)

)=

m∑i=1

P (Di+1 \Di) ≥m∑i=1

f(i+ 1)c→∞.

Therefore the assumption P (Dcm) ≥ c for all m is false, and hence we get

P [∀m. U(1− f(m), 1 + f(m)) ≥ m] = P (⋂∞i=1Ei,i) = limm→∞ P (Dc

m) = 0.

By choosing a specific decreasing non-summable function f for Theorem 9,we get that U(1− ε, 1 + ε) /∈ Ω( 1

ε log(1/ε) ) P -almost surely.

Corollary 10 (Concrete upper bound). Let P be a probability measure and let(Xt)t∈N be a nonnegative martingale with E[Xt] = 1. Then for all a, b > 0,

P[∀ε > 0. U(1− ε, 1 + ε) ≥ a

ε log(1/ε) − b]

= 0.

Proof. We proceed analogously to the proof of Corollary 7. Define

g : (0, c]→ [g(c),∞), ε 7→ a

ε ln 1ε

− b

with c < 1 and g(c) ≥ 1. We have limε→0 g(ε)→∞ and

dg

dε(ε) =

−aε2 ln 1

ε

+−a

ε2(ln 1ε )2

< 0 on (0, c].

Therefore the function g is strictly monotone decreasing and hence invertible.Choose f := g−1. Using the substitution t = g(ε), dt = dg

dε (ε)dε,

∞∑t=1

f(t) ≥∫ ∞g(c)

f(t)dt =

∫ g−1(∞)

c

f(g(ε))dg

dε(ε)dε

=

∫ 0

c

−aε ln 1

ε

dε+

∫ 0

c

−aε(ln 1

ε )2dε =

∫ − ln 0

− ln c

a

udu+

∫ 0

c

−aε(ln 1

ε )2dε

= [a lnu]+∞− ln c +

[a

ln 1ε

]0c

=∞− a ln(− ln c) + 0− aln 1

c

=∞.

Now we apply Theorem 9 to m := g(ε).

Theorem 11 (Dubins’ Inequality [Dub62, Thm. 13.1]). For every nonnegativeP -martingale (Xt)t∈N and for every c > 0 and every ε > 0,

P [U(c− ε, c+ ε) ≥ k] ≤(c−εc+ε

)kE[min

X0

c−ε , 1]

.

Dubins’ Inequality immediately yields the following bound on the probabilityof the number of upcrossings.

P [U(1− f(m), 1 + f(m)) ≥ k] ≤(

1−f(m)1+f(m)

)k.

The construction from Theorem 6 shows that this bound is asymptotically tightfor m = k → ∞ and δ → 0: define the monotone decreasing function f : N →[0, 1),

f(t) :=

δ

2m , if t ≤ m, and

0, otherwise.(1)

12

Then the martingale from Theorem 6 yields the lower bound

P [U(1− δ2k , 1 + δ

2k ) ≥ k] ≥ 1− δ,

while Dubins’ Inequality gives the upper bound

P [U(1− δ2k , 1 + δ

2k ) ≥ k] ≤

(1− δ

2k

1 + δ2k

)k=

(1− 2δ

2k + δ

)kk→∞−−−−→ exp(−δ).

As δ approaches 0, the value of exp(−δ) approaches 1− δ (but exceeds it sinceexp is convex). For δ = 0.2 and m = k = 3, the difference between the twobounds is already lower than 0.021.

The following theorem places an upper bound on the rate of expected up-crossings. In Appendix A.3 we discuss different versions of this inequality andprove this inequality tight.

Theorem 12 (Doob’s Upcrossing Inequality [Xu12]). Let (Xt)t∈N be a sub-martingale. For every c ∈ R and ε > 0,

E[Ut(c− ε, c+ ε)] ≤ 12εE[maxc− ε−Xt, 0].

Asymptotically, Doob’s Upcrossing Inequality states that with ε→ 0,

E[U(1− ε, 1 + ε)] ∈ O(

1ε

).

Again, we can use the construction of Theorem 6 to show that these asymptoticsare tight: Let f be as in (1). Then for δ = 1

2 , Corollary 7 yields a martingalefulfilling the lower bound

E[U(1− 14m , 1 + 1

4m )] ≥ m

2

and Doob’s Upcrossing Inequality gives the upper bound

E[U(1− 14m , 1 + 1

4m )] ≤ 2m,

which differs by a factor of 4. In Theorem 23 we show that Doob’s UpcrossingInequality can also be made exactly tight.

The lower bound for the expected number of upcrossings given in Corollary 7is a little looser than the upper bound given in Doob’s Upcrossing Inequality.Closing this gap remains an open problem. We know by Theorem 9 that given anon-summable function f , the uniform probability for many f(m)-upcrossingsgoes to 0. However, this does not necessarily imply that expectation also tendsto 0; low probability might be compensated for by high value. So for expectationthere might be a lower bound larger than Corollary 7, an upper bound smallerthan Doob’s Upcrossing Inequality, or both.

If we drop the requirement that the rate of upcrossings be uniform, Doob’sUpcrossing Inequality is the best upper bound we can give: using the little-onotation, assume there is a smaller upper bound g(m) ∈ o(m) such that forevery martingale process (Xt)t∈N,

E[U(1− 1

m , 1 + 1m )]∈ o(g(m)). (2)

13

In the following we sketch how to construct a martingale that violates thisbound. Define f(m) := g(m)/m, then f(m) → 0 as m → ∞, so there is aninfinite sequence (mi)i∈N such that

∑∞i=0 f(mi) ≤ 1. We define the martingale

process (Xt)t∈N such that it picks an i ∈ N with probability f(mi), and thenbecomes a martingale that makes Doob’s Upcrossing Inequality tight for up-crossings between 1−1/mi and 1+1/mi: for every i, we apply the constructionof Theorem 23. This would give the following lower bound on the expectednumber of upcrossings for each i:

∀i E[U(1− 1

mi), 1 + 1

mi)]≥ mif(mi) = g(mi).

Since there are infinitely many mi, we get a contradiction to (2). Using a similarargument, we can show that nonuniformly, Dubins’ bound is also the best wecan get.

6 Application to the MDL Principle

Let M be a countable set of probability measures on (Σω,Fω), called environ-ment class. Let K : M → [0, 1] be a function such that

∑Q∈M 2−K(Q) ≤ 1,

called complexity function on M. Following notation in [Hut09], we define foru ∈ Σ∗ the minimal description length model as

MDLu := arg minQ∈M

− logQ(Γu) +K(Q)

.

That is, − logQ(Γu) is the (arithmetic) code length of u given model Q, andK(Q) is a complexity penalty for Q, also called regularizer. Given data u ∈ Σ∗,MDLu is the measure Q ∈M that minimizes the total code length of data andmodel.

The following corollary of Theorem 6 states that in some cases the limitlimt→∞MDLv1:t does not exist with high probability.

Corollary 13 (MDL may not converge). Let P be a probability measure onthe measurable space (Σω,Fω) with perpetual entropy. For any 0 < δ < 1/2,there is a set of probability measures M containing P , a complexity functionK :M→ [0, 1], and a measurable set Z ∈ Fω with P (Z) ≥ 1− δ such that forall v ∈ Z, the limit limt→∞MDLv1:t does not exist.

Proof. Fix some positive monotone decreasing summable function f (e.g., theone given in Corollary 8). Let (Xt)t∈N be the P -martingale process from The-orem 6. By Theorem 5 there is a probability measure Q on (Σω,Fω) such that

Xt(v) =Q(Γv1:t)

P (Γv1:t)

P -almost surely. Choose M := P,Q with K(P ) := K(Q) := 1. From thedefinition of MDL and Q it follows that

Xt(u) < 1 ⇐⇒ Q(Γu) < P (Γu) =⇒ MDLu = P, and

Xt(u) > 1 ⇐⇒ Q(Γu) > P (Γu) =⇒ MDLu = Q.

14

For Z :=⋂∞m=1E

X,fm,m Theorem 6 yields

P (Z) = P [∀m. U(1− f(m), 1 + f(m)) ≥ m] ≥ 1− δ.

For each v ∈ Z, the measure MDLv1:t alternates between P and Q indefinitely,and thus its limit does not exist.

Crucial to the proof of Corollary 13 is that not only does the process Q/Poscillate indefinitely, it oscillates around the constant exp(K(Q) −K(P )) = 1.This implies that the MDL estimator may keep changing indefinitely, and thusit is inductively inconsistent.

7 Bounds on Mind Changes

Suppose we are testing a hypothesis H ⊆ Σω on a stream of data v ∈ Σω. LetP (H | Γv1:t) denote our belief in H at time t ∈ N after seeing the evidence v1:t.By Bayes’ rule,

P (H | Γv1:t) = P (H)P (Γv1:t | H)

P (Γv1:t)=: Xt(v).

Since Xt is a constant multiple of P ( · | H)/P and P ( · | H) is a probabilitymeasure on (Σω,Fω) that is absolutely continuous with respect to P on cylindersets, the process (Xt)t∈N is a P -martingale with respect to the filtration (Ft)t∈Nby Theorem 4. By definition, (Xt)t∈N is bounded between 0 and 1.

Let α > 0. We are interested in the question how likely it is to oftenchange one’s mind about H by at least α, i.e., what is the probability forXt = P (H | Γv1:t) to decrease and subsequently increase m times by at least α.Formally, we define the stopping times T ′0,ν(v) := 0,

T ′2k+1,ν(v) := inft > T ′2k,ν(v) | Xt(v) ≤ XT ′2k,ν(v)(v)− να,T ′2k+2,ν(v) := inft > T ′2k+1,ν(v) | Xt(v) ≥ XT ′2k+1,ν(v)(v) + να,

and T ′k := minT ′k,ν | ν ∈ −1,+1. (In Davis’ notation, XT ′0,ν, XT ′1,ν

, . . . isan α-alternating W-sequence for ν = 1 and an α-alternating M-sequence forν = −1 [Dav13, Def. 4].) For any t ∈ N, the random variable

AXt (α)(v) := supk ≥ 0 | T ′k(v) ≤ t,

is defined as the number of α-alternations up to time t. Let AX(α) :=supt∈NA

Xt (α) denote the total number of α-alternations.

Setting α = 2ε, the α-alternations differ from ε-upcrossings in three ways:first, for upcrossings, the process decreases below c − ε, then increases abovec + ε, and then repeats. For alternations, the process may overshoot c − ε orc+ ε and thus change the bar for the subsequent alternations, causing a ‘drift’in the target bars over time. Second, for α-alternations the initial value of themartingale is relevant. Third, one upcrossing corresponds to two alternations,since one upcrossing always involves a preceding downcrossing. See Figure 2.

To apply our bounds for upcrossings on α-alternations, we use the followinglemma by Davis. We reinterpret it as stating that every bounded martingaleprocess (Xt)t∈N can be modified into a martingale (Yt)t∈N such that the proba-bility for many α-alternations is not decreased and the number of alternations

15

t

Xt

c

c+ α2

c− α2

Figure 2: This example process has two upcrossings between c−α/2 and c+α/2(completed at the time steps of the vertical orange bars) and four α-alternations(completed when crossing the horizontal blue bars).

equals the number of upcrossings plus the number of downcrossings. A sketchof the proof can be found in Appendix A.4.

Lemma 14 (Upcrossings and alternations [Dav13, Lem. 9]). Let (Xt)t∈N be amartingale with 0 ≤ Xt ≤ 1. There exists a martingale (Yt)t∈N with 0 ≤ Yt ≤ 1and a constant c ∈ (α/2, 1− α/2) such that for all t ∈ N and for all k ∈ N,

P [AXt (α) ≥ 2k] ≤ P [AYt (α) ≥ 2k] = P [UYt (c− α/2, c+ α/2) ≥ k].

Theorem 15 (Upper bound on alternations). For every martingale process(Xt)t∈N with 0 ≤ Xt ≤ 1,

P [A(α) ≥ 2k] ≤(

1− α1 + α

)k.

Proof. We apply Lemma 14 to (Xt)t∈N and (1 − Xt)t∈N to get the processes(Yt)t∈N and (Zt)t∈N. Dubins’ Inequality yields

P [AXt (α) ≥ 2k] ≤ P [UYt (c+ − α2 , c+ −

α2 ) ≥ k] ≤

(c+ − α

2

c+ + α2

)k=: g(c+) and

P [A1−Xt (α) ≥ 2k] ≤ P [UZt (c− − α

2 , c− −α2 ) ≥ k] ≤

(c− − α

2

c− + α2

)k= g(c−)

for some c+, c− ∈ (α/2, 1− α/2). Because Lemma 14 is symmetric for (Xt)t∈Nand (1−Xt)t∈N, we have c+ = 1−c−. Since P [AXt (α) ≥ 2k] = P [A1−X

t (α) ≥ 2k]by the definition of AXt (α), we have that both are less than ming(c+), g(c−) =ming(c+), g(1−c+). This is maximized for c+ = c− = 1/2 because g is strictlymonotone increasing for c > α/2. Therefore

P [AXt (α) ≥ 2k] ≤( 1

2 −α2

12 + α

2

)k=

(1− α1 + α

)k.

Since this bound is independent of t, it also holds for P [AX(α) ≥ 2k].

16

The bound of Theorem 15 is the square root of the bound derived byDavis [Dav13, Thm. 10 & Thm. 11].

P [A(α) ≥ 2k] ≤(

1− α1 + α

)2k

(3)

This bound is tight [Dav13, Cor. 13]. A similar bound for upcrossings wasproved by Dubins [Dub72, Cor. 1].

Because 0 ≤ Xt ≤ 1, the process (1−Xt)t∈N is also a nonnegative martingale,hence the same upper bounds apply to it. This explains why the result in Theo-rem 15 is worse than Davis’ bound (3): Dubins’ bound applies to all nonnegativemartingales, while Davis’ bound uses the fact that the process is bounded frombelow and above. For unbounded nonnegative martingales, downcrossings are‘free’ in the sense that one can make a downcrossing almost surely successful(as done in the proof of Theorem 6). If we apply Dubins’ bound to the process(1−Xt)t∈N, we get the same probability bound for the downcrossings of (Xt)t∈N(which are upcrossings of (1−Xt)t∈N). Multiplying both bounds yields Davis’bound (3); however, we still require a formal argument why the upcrossing anddowncrossing bounds are independent.

The following corollary to Theorem 15 derives an upper bound on the ex-pected number of α-alternations.

Theorem 16 (Upper bound on expected alternations). For every martingale(Xt)t∈N with 0 ≤ Xt ≤ 1, the expectation E[A(α)] ≤ 1

α .

Proof. By Theorem 15 we have P [A(α) ≥ 2k] ≤(

1−α1+α

)k, and thus

E[A(α)] =

∞∑k=1

P [A(α) ≥ k]

= P [A(α) ≥ 1] +

∞∑k=1

(P [A(α) ≥ 2k] + P [A(α) ≥ 2k + 1]

)≤ 1 +

∞∑k=1

2P [A(α) ≥ 2k] ≤ 1 + 2

∞∑k=1

(1− α1 + α

)k=

1

α.

We now apply the technical results of this section to the martingale processXt = P ( · | H)/P , our belief in the hypothesis H as we observe data. Theprobability of changing our mind k times by at least α decreases exponentiallywith k (Theorem 15). Furthermore, the expected number of times we changeour mind by at least α is bounded by 1/α (Theorem 16). In other words, havingto change one’s mind a lot often is unlikely.

Because in this section we consider martingales that are bounded between0 and 1, the lower bounds from Section 4 do not apply here. While for themartingales constructed in Theorem 6, the number of 2α-alternations and thenumber of α-up- and downcrossings coincide, these processes are not bounded.However, we can give a similar construction that is bounded between 0 and 1and makes Davis’ bound asymptotically tight.

17

8 Conclusion

We constructed an indefinitely oscillating martingale process from a summablefunction f . Theorem 6 and Corollary 7 give uniform lower bounds on the prob-ability and expectation of the number of upcrossings of decreasing magnitude.In Theorem 9 we proved the corresponding upper bound if the function f is notsummable. In comparison, Doob’s Upcrossing Inequality and Dubins’ Inequalitygive upper bounds that are not uniform. In Section 5 we showed that for a cer-tain summable function f , our martingales make these bounds asymptoticallytight as well.

Our investigation of indefinitely oscillating martingales was motivated by twoapplications. First, in Corollary 13 we showed that the minimum descriptionlength operator may not exist in the limit: for any probability measure P wecan construct a probability measure Q such that Q/P oscillates forever aroundthe specific constant that causes limt→∞MDLv1:t to not converge.

Second, we derived bounds for the probability of changing one’s mind abouta hypothesis H when observing a stream of data v ∈ Σω. The probabilityP (H | Γv1:t) is a martingale and in Theorem 15 we proved that the probabilityof changing the belief in H often by at least α decreases exponentially.

A question that remains open is whether there is a uniform upper bound onthe expected number of upcrossings tighter than Doob’s Upcrossing Inequality.

References

[Dav13] Ernest Davis. Bounding changes in probability over time: It is unlikelythat you will change your mind very much very often. Technical report,2013. https://cs.nyu.edu/davise/papers/dither.pdf.

[Doo53] Joseph L. Doob. Stochastic Processes. Wiley, New York, 1953.

[Dub62] Lester E Dubins. Rises and upcrossings of nonnegative martingales.Illinois Journal of Mathematics, 6(2):226–241, 1962.

[Dub72] Lester E Dubins. Some upcrossing inequalities for uniformly boundedmartingales. Symposia Mathematica, IX:169–177, 1972.

[Dur10] Rick Durrett. Probability: Theory and Examples. Cambridge Univer-sity Press, 4th edition, 2010.

[Gru07] Peter D. Grunwald. The Minimum Description Length Principle. TheMIT Press, Cambridge, 2007.

[Hut09] Marcus Hutter. Discrete MDL predicts in total variation. In Advancesin Neural Information Processing Systems 22 (NIPS’09), pages 817–825, Cambridge, MA, USA, 2009. Curran Associates.

[LH14] Jan Leike and Marcus Hutter. Indefinitely oscillating martingales.In Proc. 25th International Conf. on Algorithmic Learning Theory(ALT’14), pages 321–335. Springer, 2014.

[LS05] Wei Luo and Oliver Schulte. Mind change efficient learning. In Proc.18th Annual Conference on Learning Theory (COLT’05), volume 3559of LNAI, pages 398–412, Bertinoro, Italy, 2005. Springer.

18

https://cs.nyu.edu/davise/papers/dither.pdf

[PH05] Jan Poland and Marcus Hutter. Asymptotics of discrete MDLfor online prediction. IEEE Transactions on Information Theory,51(11):3780–3795, November 2005.

[Ris78] Jorma Rissanen. Modeling by shortest data description. Automatica,14(5):465–471, 1978.

[RW94] L Chris G Rogers and David Williams. Diffusions, Markov Processes,and Martingales: Volume 1, Foundations. Cambridge University Press,2nd edition, 1994.

[Wal05] Christopher S. Wallace. Statistical and Inductive Inference by Mini-mum Message Length. Springer, Berlin, 2005.

[WB68] Christopher S. Wallace and David M. Boulton. An information measurefor classification. Computer Journal, 11(2):185–194, August 1968.

[Xu12] Weijun Xu. Martingale convergence theorems. Technicalreport, 2012. http://people.maths.ox.ac.uk/xu/Martingale_

convergence.pdf.

19

http://people.maths.ox.ac.uk/xu/Martingale_convergence.pdf

http://people.maths.ox.ac.uk/xu/Martingale_convergence.pdf

A Appendix

A.1 Notation

• := denotes a definition.

• Ac := Σω \A denotes the complement of a measurable set A ⊆ Σω.

• ] denotes disjoint union.

• For a set X, the power set of X is denoted by 2X .

• 1X is the characteristic function for a set X, i.e., 1X(x) = 1 if x ∈ X and0 otherwise.

• ω is the smallest infinite ordinal.

• N is the set of natural numbers.

• R is the set of real numbers.

• For a, b ∈ R, [a, b] denotes the closed interval with end points a and b; (a, b]and [a, b) denote half-open intervals and (a, b) denotes an open interval.

• The set Σ denotes a finite alphabet. The set of all finite strings of lengthn is denoted Σn, the set of all finite strings is denoted Σ∗, and the set ofall infinite strings is denoted Σω.

• For a string u ∈ Σ∗, |u| denotes the length of u.

• For v ∈ Σω, v1:k denotes the first k characters of v.

• f ∈ Ω(g) denotes g ∈ O(f), i.e., ∃k > 0 ∃x0 ∀x ≥ x0. g(x) · k ≤ f(x).

• f ∈ o(g) denotes limx→∞f(x)g(x) = 0.

A.2 Measures and Martingales

In this section we prove Theorem 4 and Theorem 5, establishing the connectingbetween measures on infinite strings and martingales.

Proof of Theorem 4. Xt is only undefined if P (Γv1:t) = 0. The set

v ∈ Σω | ∃t. P (Γv1:t) = 0

has P -measure 0 and hence (Xt)t∈N is well-defined almost everywhere.Xt is constant on Γu for all u ∈ Σt, and Ft is generated by a collection of

finitely many disjoint sets:Σω =

⊎u∈Σt

Γu.

(a) Therefore Xt is Ft-measurable.

20

(b) Γu =⊎a∈Σ Γua for all u ∈ Σt and v ∈ Γu, and therefore

E[Xt+1 | Ft](v) =1

P (Γu)

∑a∈Σ

Xt+1(ua)P (Γua) =1

P (Γu)

∑a∈Σ

Q(Γua)

P (Γua)P (Γua)

(∗)=

1

P (Γu)

∑a∈Σ

Q(Γua) =Q(Γu)

P (Γu)= Xt(v).

At (∗) we used the fact that Q is absolutely continuous with respect toP on cylinder sets. (If Q were not absolutely continuous with respect toP on cylinder sets there are cases where P (Γu) > 0, P (Γua) = 0, andQ(Γua) 6= 0. Therefore Xt+1(ua) does not contribute to the expectationand thus Xt+1(ua)P (Γua) = 0 6= Q(Γua).)

P ≥ 0 and Q ≥ 0 by definition, thus Xt ≥ 0. Since P (Γε) = Q(Γε) = 1, we haveE[X0] = 1.

The following lemma gives a convenient condition for the existence of ameasure on (Σω,Fω). It is a special case of the Daniell-Kolmogorov ExtensionTheorem [RW94, Thm. 26.1].

Lemma 19 (Extending measures). Let q : Σ∗ → [0, 1] be a function such thatq(ε) = 1 and

∑a∈Σ q(ua) = q(u) for all u ∈ Σ∗. Then there exists a unique

probability measure Q on (Σω,Fω) such that q(u) = Q(Γu) for all u ∈ Σ∗.

To prove this lemma, we need the following two ingredients.

Definition 20 (Semiring). A set R ⊆ 2Ω is called semiring over Ω iff

(a) ∅ ∈ R,

(b) for all A,B ∈ R, the set A ∩B ∈ R, and

(c) for all A,B ∈ R, there are pairwise disjoint sets C1, . . . , Cn ∈ R such thatA \B =

⊎ni=1 Ci.

Theorem 21 (Caratheodory’s Extension Theorem [Dur10, Thm. A.1.1]). LetR be a semiring over Ω and let µ : R → [0, 1] be a function such that

(a) µ(Ω) = 1 (normalization),

(b) µ(⊎ni=1Ai) =

∑ni=1 µ(Ai) for pairwise disjoint sets A1, . . . , An ∈ R such

that⊎ni=1Ai ∈ R (finite additivity), and

(c) µ(⋃i≥0Ai) ≤

∑i≥0 µ(Ai) for any collection (Ai)i≥0 such that each Ai ∈ R

and⋃i≥0Ai ∈ R (σ-subadditivity).

Then there is a unique extension µ of µ that is a probability measure on(Ω, σ(R)) such that µ(A) = µ(A) for all A ∈ R.

Proof of Lemma 19. We show the existence of Q using Caratheodory’s Exten-sion Theorem. Define R := Γu | u ∈ Σ∗ ∪ ∅.

(a) ∅ ∈ R.

(b) For any Γu,Γv ∈ R, either

21

• u is a prefix of v and Γu ∩ Γv = Γv ∈ R, or

• v is a prefix of u and Γu ∩ Γv = Γu ∈ R, or

• Γu ∩ Γv = ∅ ∈ R.

(c) For any Γu,Γv ∈ R,

• Γu \ Γv =⊎w∈Σ|v|−|u|\x Γuw if v = ux, i.e., u is a prefix of v, and

• Γu \ Γv = ∅ otherwise.

Therefore R is a semiring. By definition of R, we have σ(R) = Fω.The function q : Σ∗ → [0, 1] naturally gives rise to a function µ : R → [0, 1]

with µ(∅) := 0 and µ(Γu) := q(u) for all u ∈ Σ∗. We will now check theprerequisites of Caratheodory’s Extension Theorem.

(a) (Normalization.) µ(Σω) = µ(Γε) = q(ε) = 1.

(b) (Finite additivity.) Let Γu1, . . . ,Γuk ∈ R be pairwise disjoint sets such that

Γw :=⊎ki=1 Γui ∈ R. Let ` := max|ui| | 1 ≤ i ≤ k, then Γw =

⊎v∈Σ` Γwv.

By assumption,∑a∈Σ q(ua) = q(u), thus

∑a∈Σ µ(Γua) = µ(Γu) and induc-

tively we haveµ(Γui) =

∑s∈Σ`−|ui|

µ(Γuis), (4)

andµ(Γw) =

∑v∈Σ`

µ(Γwv). (5)

For every string v ∈ Σ`, the concatenation wv ∈ Γw =⊎ki=1 Γui , so there is

a unique i such that wv ∈ Γui . Hence there is a unique string s ∈ Σ`−|ui|

such that wv = uis. Together with (4) and (5) this yields

µ

(k⊎i=1

Γui

)= µ(Γw) =

∑v∈Σ`

µ(Γwv) =

k∑i=1

∑s∈Σ`−|ui|

µ(Γuis) =

k∑i=1

µ(Γui).

(c) (σ-subadditivity.) We will show that each Γu is compact with respect to thetopology O generated by R. σ-subadditivity then follows from (b) becauseevery countable union is in fact a finite union.

We will show that the topology O is the product topology of the discretetopology on Σ. (This establishes that (Σω,O) is a Cantor Space.) Ev-ery projection πk : Σω → Σ selecting the k-th symbol is continuous, sinceπ−1k (a) =

⋃u∈Σk−1 Γua for every a ∈ Σ. Moreover, O is the coarsest topol-

ogy with this property, since we can generate every open set Γu ∈ R in thebase of the topology by

Γu =

|u|⋂i=1

π−1i (ui).

The set Σ is finite and thus compact. By Tychonoff’s Theorem, Σω is alsocompact. Therefore Γu is compact since it is homeomorphic to Σω via thecanonical map βu : Σω → Γu, v 7→ uv.

22

From (a), (b), and (c) Caratheodory’s Extension Theorem yields a uniqueprobability measure Q on (Σω,Fω) such that Q(Γu) = µ(Γu) = q(u) for allu ∈ Σ∗.

Using Lemma 19, the proof of Theorem 5 is now straightforward.

Proof of Theorem 5. We define a function q : Σ∗ → R, with

q(u) := X|u|(v)P (Γu)

for any v ∈ Γu. The choice of v is irrelevant because X|u| is constant on Γu sinceit is Ft-measurable. In the following, we also write Xt(u) if |u| = t to simplifynotation.

The function q is non-negative because Xt and P are both non-negative.Moreover, for any u ∈ Σt,

1 = E[Xt] =

∫ΣωXtdP ≥

∫Γu

XtdP = P (Γu)Xt(u) = q(u).

Hence the range of q is a subset of [0, 1].We have q(ε) = X0(ε)P (Γε) = E[X0] = 1 since P is a probability measure

and F0 = ∅,Σω is the trivial σ-algebra. Let u ∈ Σt.∑a∈Σ

q(ua) =∑a∈Σ

Xt+1(ua)P (Γua) =

∫Γu

Xt+1dP

=

∫Γu

E[Xt+1 | Ft]dP =

∫Γu

XtdP = P (Γu)Xt(u) = q(u).

By Lemma 19, there is a probability measure Q on (Σω,Fω) such that q(u) =Q(Γu) of all u ∈ Σ∗. Therefore, for all v ∈ Σω and for all t ∈ N with P (Γv1:t) > 0,

Xt(v) =q(v1:t)

P (Γv1:t)=Q(Γv1:t)

P (Γv1:t).

Moreover, Q is absolutely continuous with respect to P on cylinder sets sinceP (Γu) = 0 implies

Q(Γu) = q(u) = X|u|(u)P (Γu) = 0.

A.3 Different Upcrossing inequalities and their tightness

There are different versions of the upcrossing inequality in circulation. Let a < band let (Xt)t∈N be a martingale process. Doob [Doo53, VII§3 Thm. 3.3] states

E[Ut(a, b)] ≤ 1b−aE[maxXt − a, 0]. (6)

Durrett [Dur10, Thm. 5.2.7] gives a slightly stronger version:

E[Ut(a, b)] ≤ 1b−a

(E[maxXt − a, 0]− E[maxX0 − a, 0]

). (7)

We will prove tight the version stated in Theorem 12 [Xu12, Thm. 1.1]:

E[Ut(a, b)] ≤ 1b−aE[maxa−Xt, 0]. (8)

23

For nonnegative martingales we can estimate E[maxa − Xt, 0] ≤ a to get abound independent of t from the upcrossing inequality (8). To get a boundindependent of t from (6) or (7), we look at the upcrossings of the martingaleprocess (−Xt)t∈N, which are the downcrossings of (Xt)t∈N. The number ofdowncrossings differs from the number of upcrossings by at most 1, so we canconclude from (7),

E[UXt (a, b)] ≤ E[U−Xt (−b,−a)] + 1

≤ 1b−a

(E[maxa−Xt, 0]− E[maxa−X0, 0]

)+ 1.

The origin of the diversity in upcrossing inequalities stems from the details oftheir proofs. When we start betting every time the process (Xt)t∈N falls below aand stop every time it rises above b, our gain at time t is at least (b− a)Ut(a, b)plus some amount R that we gained or lost since we started betting last timein case the last upcrossing has not yet completed. Because we are bettingon a martingale, our expected gain is zero, hence (b − a)E[Ut(a, b)] = E[−R].The right hand sides of the equations (6), (7), and (8) arise from the way weestimate R from below. The inequality (8) estimates R by taking into accountany possible losses ignoring gains since we last started betting at a. Contrarily,(6) estimates R by taking into account any possible gains ignoring losses sincewe started betting at a. In (7) we additionally suppose that we are bettingstarting at time 0 and take into account any losses before Xt falls below a forthe first time.

Lemma 22 (Tightness Criterion for (8)). Let a < b and let (Xt)t∈N be a mar-tingale such that

(a) Xt does not assume any values between a and b, and

(b) all upcrossings are completed at b and all downcrossings are completed at a:

XT2k= b and XT2k+1

= a ∀k ∈ N.

Then the inequality (8) is tight, i.e.,

E[Ut(a, b)] = 1b−aE[maxa−Xt, 0].

Proof. This proof essentially follows the proof of Doob’s Upcrossing Inequalitygiven in [Xu12]. Define the process

Dt(v) :=

∞∑k=1

(Xmint,T2k(v)−Xmint,T2k−1(v)

).

Since all but finitely many terms in the infinite sum are zero, Dt is well-defined.The process (Dt)t∈N is martingale:

E[Dt+1 | Ft] =

∞∑k=1

(E[Xmint+1,T2k | Ft]− E[Xmint+1,T2k−1 | Ft]

).

Fix some i ∈ N. Conditioning on Ft, we know whether Ti > t or Ti ≤ t since Tiis a stopping time. In case Ti > t we have t+ 1 ≤ Ti, implying Xmint+1,Ti =Xt+1 and thus E[Xmint+1,Ti | Ft] = E[Xt+1 | Ft] = Xt = Xmint,Ti because

24

(Xt)t∈N is martingale. In case Ti ≤ t we have Xmint+1,Ti = XTi and henceE[Xmint+1,Ti | Ft] = E[XTi | Ft] = XTi = Xmint,Ti. In both cases we getE[Xmint+1,Ti | Ft] = Xmint,Ti, therefore E[Dt+1 | Ft] = Dt.

Let t ∈ N be some time step, and fix v ∈ Σω. Let Ut := Ut(a, b) denote thenumber of upcrossings that have been completed up to time t. We distinguishthe following two cases.

(i) There is an incomplete upcrossing, T2Ut+1 ≤ t < T2Ut+2.

(ii) There is no incomplete upcrossing, T2Ut ≤ t < T2Ut+1.

In case (i) we have Xt < b and therefore Xt ≤ a by assumption (a). Withassumption (b) we get

Dt =

Ut∑k=1

(XT2k−XT2k−1

) +Xt −XT2Ut+1

=

Ut∑k=1

(b− a) +Xt − a = (b− a)Ut +Xt − a.

(9)

In case (ii) we have Xt > a. With assumption (b) we get

Dt =

Ut∑k=1

(XT2k−XT2k−1

) =

Ut∑k=1

(b− a) = (b− a)Ut. (10)

From (9) and (10) follows that

Dt = (b− a)Ut + minXt − a, 0.

We have D0 = 0 and since (Dt)t∈N is martingale it follows that E[Dt] = 0.Hence

(b− a)E[Ut] = E[−minXt − a, 0] = E[maxa−Xt, 0].

Theorem 23 (Tightness of Doob’s Upcrossing Inequality). Let P be a proba-bility measure with perpetual entropy. For all b > a > 0 there is a nonnegativemartingale (Xt)t∈N with X0 = a that makes Doob’s Upcrossing Inequality tightfor all t > 0:

0 < E[UXt (a, b)] = 1b−aE[maxa−Xt, 0].

We added the requirement E[UXt (a, b)] > 0, because otherwise the constantprocess Xt = a would trivially make the inequality tight.

Proof. Fix c := (a + b)/2 and set f(t) := (b − a)/(b + a) = (b − a)/(2c); thenf(t) < 1 because a > 0. Define X0 := X1 := a/c. The function f is notsummable, but we nonetheless apply the same construction as in Theorem 6:for t > 1 let Xt be defined as in the proof of Theorem 6. We prove that thescaled process Yt := c · Xt makes the inequality (8) tight. Since (Xt)t∈N doesupcrossings between 1 − f(Mt) = a/c and 1 + f(Mt) = b/c, the scaled process(Yt)t∈N does upcrossings between a and b.

By Claim 1 (Xt)t∈N is a martingale process, and by Claim 5 Xt ≥ 0, hencethis also applies to the scaled process (Yt)t∈N. We check the criterion given inLemma 22.

25

(a) This holds for Xt for t = 0 and t = 1 according to the definition of (Xt)t∈N.For t > 1 this follows from Claim 6 since T1 = 1 because X1 = a/c.

(b) From Claim 4, this is fulfilled in cases (i) and (ii). In case (iii) we haveXt+1 ≤ 1− f(Mt), so the process cannot do an (up-)crossing.

It remains to show that E[UYt (a, b)] > 0. By Claim 2 X0 > γ0, so for allv ∈ Σω with v0 = aε we have X1 = 1 + f(Mt), therefore UX1 (a/c, b/c)(v) ≥ 1.Since P (Γaε) > 0 by assumption, this yields E[UX1 (a/c, b/c)] > 0 and henceE[UYt (a, b)] > 0 for all t > 1.

The process from Theorem 23 also gives a tightness result as t → ∞. Aweaker lower bound E[UX(a, b)] ≥ a+b

8(b−a) −12 can be derived directly from

Corollary 7 using δ := 1/2 and

f(i) :=

b−ab+a , if i ≤ b+a

4(b−a) ,

0, otherwise,

and scaling the process with (b+ a)/2.

Corollary 24 (Asymptotic tightness of Doob’s Upcrossing Inequality). Let Pbe a probability measure with perpetual entropy. For all b > a > 0 there is anonnegative martingale (Xt)t∈N with X0 = a such that

E[UX(a, b)] = ab−a .

Proof. Consider the process (Yt)t∈N from the proof of Theorem 23. Since(Xt)t∈N is a nonnegative martingale, the Martingale Convergence Theo-rem [Dur10, Thm. 5.2.8] implies that (Xt)t∈N converges almost surely to a limitXω ≥ 0. This limit can only be 0 or 1 by Claim 8. Since f(t) = (b−a)/(2c) > 0for all t, (Xt)t∈N does not converge to 1 by Claim 6 (T1 = 1 by construction).Thus Xω = 0 almost surely, but this generally does not imply limt→∞ E[Xt] =0 [Dur10, Ex. 5.2.3]. However, maxa − Yt, 0 = maxa − cXt, 0 is bounded,therefore uniformly integrable. By [Dur10, Thm. 5.5.2] (a generalization of thedominated convergence theorem),

limt→∞

E[maxa− Yt, 0] = E[maxa− cXω, 0] = a. (11)

By Dubins’ Inequality,

E[UYt (a, b) · 1UYt (a,b)≥k] =

∞∑i=k

P [UYt (a, b) ≥ i] ≤∞∑i=k

aib−i

= akb−k bb−a → 0 as k →∞,

hence UYt (a, b) is also uniformly integrable and by the same theorem [Dur10,Thm. 5.5.2] and (11),

(b− a)E[UY (a, b)] = limt→∞

(b− a)E[UYt (a, b)] = limt→∞

E[maxa− Yt, 0] = a.

The same process can also be used to show that Dubins’ Inequality is tight.For a specific underlying probability measure a proof of this is sketched byDubins [Dub62, Thm. 12.1]. We prove a version that is agnostic with respectto the probability measure P .

26

Corollary 25 (Tightness of Dubins’ Inequality). Let P be a probability measurewith perpetual entropy. For all b > a > 0 there is a nonnegative martingale(Xt)t∈N with X0 = a that makes Dubins’ Inequality tight:

P [UX(a, b) ≥ k] =ak

bk

Proof. We use Dubins’ Inequality on the process (Xt)t∈N from Corollary 24;

E[UX(a, b)] =

∞∑k=1

P [UX(a, b) ≥ k] ≤∞∑k=1

ak

bk=

a

b− a= E[UX(a, b)],

so the involved inequalities must in fact be equalities.

A.4 Davis’ Lemma

We do not reproduce Davis’ proof in detail. It needs to be adapted to themartingale setting, which is quite cumbersome to do. Below we give an outlineof the proof.

Proof sketch for Lemma 14. This proof relies on the observation that the prob-ability that Xt rises (falls) by at least α does not decrease as α decreases. For-mally, we argued in the proof of Theorem 9 that P (XT ≥ x+α | Xt = x) ≤ x

x+αusing the Optional Stopping Theorem. Since (Xt)t∈N is bounded by 1 fromabove, the same argument can be carried out for the process (1−Xt)t∈N, givingan analogous bound P (XT ≤ x−α | Xt = x) ≤ 1−x

1−x+α when (Xt)t∈N decreases.These bounds are tight.

The idea of the proof is to define a martingale process (Zt)t∈N; the processdefined by Yt := Xt + Zt is then a martingale. We need to show that (Yt)t∈Nhas the desired properties: 0 ≤ Yt ≤ 1 and the probability of having at least 2k2α-alternations of (Xt)t∈N does not exceed the probability of having at least kα-upcrossings of (Yt)t∈N.

There are two sources of misalignment between 2ε-alternations and ε-upcrossings; we consider them in turn.

First, drift: if the martingale (Xt)t∈N overshoots the target and becomeslarger than XT ′2k+1

+ α or smaller than XT ′2k− α, it changes the target in

subsequent alternations. Without loss of generality, consider the first case.Suppose we are in time step t, have observed u ∈ Σt and 2k + 1 alterna-tions have been completed, i.e., T ′2k+1 ≤ t < T ′2k+2. By observing a symbola ∈ Σ, we would have Xt+1(ua) ≥ XT ′2k+1

(u) + α with a possible overshootγ := Xt+1(ua) − (XT ′2k+1

+ α). To compensate, we set Zt+1(ua) = Zt(u) − γand Zt+1(ub) ≥ Zt appropriately for b ∈ Σ\a such that Zt fulfills the martin-gale condition (b) of Definition 2. Removing the overshoots from the martingalemakes upcrossings and alternations coincide, i.e. AYt (α) = 2UYt (c−α/2, c+α/2)for a suitable constant c, which we will discuss below. According to the afore-mentioned observation, the new martingale is at least as likely to complete thealternation as the old one, since we have reduced the distance needed to betraveled.

Second, the initial value Y0. Let c ∈ (α/2, 1−α/2) and define Z0 := c+α/2−X0. The constant c denotes the center of the alternations, i.e., Yn alternatesbetween c−α/2 and c+α/2, since Y0 = X0+Z0 = c+α/2. What value should we

27

assign to c? Since we only care about cases where the number of alternationsis even, c = 1/2 maximizes the probability of successful upcrossings [Dav13,Lem. 7]. This intuitively makes sense: there is an equal number of up- anddowncrossings and the probability of each of them being successful depends onthe process’ distance from 0 or 1 respectively.

At this point we have a martingale process (Yt)t∈N that is bounded between 0and 1, and upcrossings and alternations coincide: AYt (α) = 2UYt ( 1−α

2 , 1+α2 ). It

remains to show that the probability of at least k alternations has not decreasedcompared to the process (Xt)t∈N. By construction, this is already to case forsingle down- and upcrossings. However, there could be cases where the drift thatwe removed from the process would cause us to move to a region where successfulalternations are more likely. But since we centered the process optimally, thisis not possible.

There is one other technical problem that we glossed over: we have to makesure that the process (Yt)t∈N exceeds neither 0 nor 1; We have to stop theprocess at these points. Moreover, if the process Yt reaches 1 + Zt or Zt butits value is in (0, 1) instead of stopping it, we switch to a random walk until we‘get back on track’. ♦

28

indefinitely oscillating martingales · connection between probability measures on in nite strings...

Documents