estimation of change point of distribution in a bit...

A Project Report on

ESTIMATION OF CHANGE POINT OF DISTRIBUTION IN A BIT-STREAM USING UNIVERSAL

COMPRESSION

Department of Electrical Engineering

Indian Institute of Technology Kanpur

April 2013

Shouvik Ganguly Project guide: Dr. R.K. Bansal Y9558 Professor, Department of Electrical Engineering

IIT Kanpur

ACKNOWLEDGEMENTS

First and foremost, I am grateful to Dr. RK Bansal for his able guidance and support throughout my BTech project. Every time I have been stuck anywhere in the work, he has provided insights, observations, suggestions and references which have invariably proved valuable.

I am thankful to the committee members - Dr. Adrish Bannerjee, Dr. Aditya K Jagannatham, Dr. Ketan

Rajawat, Dr. RK Bansal, and Dr. Amit Mitra, for providing an attentive audience to my presentation and for providing valuable criticism.

In particular, I thank Dr. Adrish Bannerjee Sir for suggesting the problem of multiple changes, out of which a section of this report was born. I thank Dr. Aditya K Jagannatham Sir for providing valuable suggestions on how to improve the legibility of simulations.

In working through the second part of my BTech project, I benefited immensely from several

discussions with Mr. Siddharth Jain. He provided valuable theoretical insights and helped me a lot when I was stuck.

I am indebted to my parents and my sister for supporting me through every stage in life. Without my

family’s support, I wouldn’t be where I am today.

I thank all my friends and wing-mates at IIT Kanpur, who helped fill the emptiness of living so far away from my family.

Lastly, I want to thank whoever contributed directly or indirectly to this project.

CONTENTS

Acknowledgements Abstract. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Johnson-Sejdinovic-Cruise-Ganesh-Piechocki Algorithm: A Simulation Study. . . . . . . . . . . 3 A Study of Redundancy Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 A New Algorithm to Estimate Change Point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 Ziv’s Empirical Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Convergence of the Estimator : A Justification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 Simulation Studies with LZ77. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20 Another Version of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Simulation Studies for the Modified Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26 Detection of Change in Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35 Detection of Multiple Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

ABSTRACT

The Project focuses on estimation of change point in a fixed sample framework through Universal Compression, without any knowledge of source statistics, with only the stipulation that each underlying distribution (before and after change) is stationary and ergodic. In the first part of the Project, we perform simulation studies on a String Matching Algorithm to estimate change point. In the second part, we examine the redundancy rates of some well-known compression algorithms and propose an algorithm to estimate the change point. The underlying principle of the algorithm is to approximate the log likelihood of a sequence through universal compression. We perform simulation studies on this algorithm for change from i.i.d. to i.i.d, i.i.d to Markov, Markov to i.i.d. and Markov to Markov. Next we provide some theoretical justification for the consistency of this algorithm using the concept of Empirical Entropy defined in [15]. We also perform some more simulations using a different universal algorithm. We state a different version of the algorithm which can be more useful in certain situations. In particular, the new version can detect multiple change-points, even when the number of change-points is not known. We perform simulation studies on this new version and use it to detect change in the author of a portion of prose. Finally, we perform some simulations for the multiple change-point case.

INTRODUCTION

The problem of detecting and/or estimating sudden changes in data has vast applicationsacross different fields, ranging from quality control, remote sensing, detection of fraud inaccounting and stock markets, to medical applications.

When one tries to detect change point in an incoming stream of data, it is the sequentialchange detection problem, studied first by Neyman and Pearson [1], and followed up byrenowned statisticians like Lorden [2], Wald [3] and so on. When all the data is alreadyavailable and one analyses it to determine change point, if any, it is the fixed sampleproblem. When, in addition, it is already known that the data does have a change pointsomewhere, the problem reduces to one of estimation of change point.

The change point problem can be classified in another way as parametric or non-parametric. When both the distributions before and after the change are known explicitly,it is a parametric change point detection/estimation problem. The simplest and mostwidely known parametric test for change point is the likelihood ratio test, developed byNeyman and Pearson [1]. When the distributions are not known explicitly, their formscannot be used for detection/estimation, so we have to take recourse to a non-parametricapproach.

The problem we address is that of change point estimation in fixed sample and non-parametric framework.

Given a random process XN1 ≡ {Xm}Nm=1 which takes values from a finite set χ,

let p0 be the distribution underlying the first n letters and p1 be the distribution under-lying the subsequent letters.The fixed sample change estimation problem is to find an estimator n for the changepoint n, or equivalently, to find an estimator γ for the relative change point γ = n

N

2

JOHNSON-SEJDINOVIC-CRUISE-GANESH-PIECHOCKI ALGORITHM

The objective of this part was to perform some simulation studies on the JSCGP Al-gorithm for change point estimation in fixed sample framework. This algorithm wasdeveloped in a 2011 paper by Oliver Johnson, Dino Sejdinovic, James Cruisez, AyalvadiGanesh and Robert Piechocki [4].

Some Definitions

Definition 1: For a sequence x of length n, we define the match length at position i ofthe sequence asLni (x) = min{L : xk+L−1k 6= xi+L−1i for any k 6= i, 1 ≤ k ≤ N}Physically, it represents the length of the shortest mismatch at position i.

Definition 2: Sni = set of the positions of the match at i = {j : xi+Ln

i −2i = x

j+Lni −2

j , 1 ≤j ≤ n, j 6= i}

T ni = rand(Sni ), i.e., T ni is chosen randomly from amongst the elements of Sni .

The Algorithm:We form a directed graph by linking each i to the corresponding T ni .For each j such that 1 ≤ j ≤ n, we define the crossings functions CRL(j) and CLR(j) as

CLR(j) = #{k : k < j ≤ T nk } andCRL(j) = #{k : T nk < j ≤ k}

which represent, respectively, the number of left-to-right and right-to-left crossings ateach j.Definition 3: The normalised crossing processes are defined as

ΨLR(j) = CLR(j)n−j −

jn , and

ΨRL(j) = CRL(j)j − n−j

n

Final step: We take the estimator of γ asγ = 1

narg minj

(max(ΨLR(j),ΨRL(j)))

We performed simulation studies on this algorithm for a sample size of 2000 and changepoint at 800, 1000, 1100, 1200 and 1400. We ran 100 iterations of the algorithm to de-termine average deviation of the estimate from the true value of γ. The changes weretaken from i.i.d. to i.i.d.

3

A STUDY OF REDUNDANCY RATES

Let ψ(x) represent the compressed version of a sequence x using some universal code andlet |ψ(x)| represent the length of ψ(x). Then we know that

|ψ(x)|n → − log(p(x))

n in probability, as the length of the sequence, n, goes to infinity.

However, |ψ(x)| + log(p(x)) also grows with n, and this quantity is defined as the re-dundancy (R) of the code in question.R/n is defined as the redundancy rate, which obviously decays with large n.

Definition: For a random process π and a universal code ψ, the maximum pointwiseredundancy is defined as

8

Rmax = maxxn1∈An

(|ψ(xn1)|+ log[π(xn1)])

A lot of literature is available on redundancy rates of various codes, and most of themspeak in terms of some limits on the rate of decay of the maximum redundancy rate ofa code.

We study some of these in this section.

Notation:For a process {Xi}∞i=−∞, F b

a represents the sigma-field associated with the collection ofcodewords {Xa, Xa+1, ...Xb}

Definition: φ-mixingFor a process {Xi}∞i=−∞, the φ- and ψ-mixing coefficients are defined as

φ(n) = supA∈F0−∞,B∈F∞n {|P (B|A)− P (B)|}

ψ(n) = supA∈F0−∞,B∈F∞n {|

P (B∩A)P (B)P (A) − 1|}

A process is called, respectively, φ-mixing or ψ-mixing when φ(n) → 0 or ψ(n) → 0

ψ-mixing is a stronger condition

Definition: Finite Information Order Property (Yang and Kieffer, [5])A source {Xi}∞i=1 is said to obey the finite information order property if there exists a con-stant c such that for any finite complete prefix set M, the resulting sequence {X(j)}∞j=1

of random codewords satisfies

limK→∞1K log( P (X

1X2...Xk)

K∏j=1

(P (Xj))

) ≤ c

Redundancy of Fixed Database Lempel-Ziv (FDLZ) code

Theorem 1 (Yang and Kieffer, [5]):Let {Xi}∞i=1 be a stationary source with entropy rate H. Suppose {Xi}∞i=1 is φ-mixingwith summable mixing coefficients, and obeys the finite information order property.Also, let inf{( P (x)

P (x|x|−11 )

) : P (x) > 0} = β > 0

Then for sufficiently large n

9

Rmax ≤ 2H loglognlogn +O( logloglognlogn )

Theorem 2 (Yang and Kieffer, [5]):Let {Xi}∞i=1 be a stationary source with entropy rate H. Suppose {Xi}∞i=1 is φ-mixing

with summable mixing coefficients. Also, let inf( P (x)

P (x|x|−11 )

) : P (x) > 0) = β > 0

And sup{P (G∩F )P (G) : F ∈ Fm1 , P (F ) > 0, G ∈ F∞m+h, P (G) > 0, m ≥ 1} is finite for

some h > 0.Then for sufficiently large n

Rmax ≥ H loglognlogn +O( logloglognlogn )

Redundancy of LZ78 code

Theorem3 (Yang, 2005 [6]): For a d-mixing source, the pointwise redundancy rate isbounded as

maxxn1∈χn

R(xn1) ≤ MQ1

M−1 + log( 2Md(M−1)e).

logM(1−εn)logn +O( lognn )

where εn → 0 and Q1 is the 1st order Markov approximation to x, given by

Q1(x1, x2, ...xk) = p(x1)k∏j=2

p(xj|xj−1)

A NEW ALGORITHM TO ESTIMATE CHANGE POINT

Motivation: We know that the negative of the log likelihood is of the order of the com-pressed length for large sequence size. So it is natural to ask the question: can we modifythe likelihood ratio test to an entirely non-parametric setting? And the automatic answerto this is: we can approximate likelihoods by compressed lengths, for large sequence size.

Hence if p0 be the distribution for the first part and p1 be that for the second part, maxi-mizing log(p0(first part)p1(second part)), i.e., minimizing (log(p0(firstpart))log(p1(secondpart))),is equivalent to minimizing |ψ(firstpart)|+ |ψ(secondpart)|

Hence it is quite natural to look for the change point near where the above quantityis minimized.

10

Some Results:

Let a sequence x take values from a set A.

The type of the sequence xn1 is defined as the empirical distribution Pxn1

where Pxn1 (a) = |{i:xi=a}|n , a ∈ A

A distribution P on A is called an n-type if it is the type of some xn1 ∈ An. The setof all xn1 ∈ An of type P is called the type class of the n-type P and is denoted by τPn .

Theorem 4 (Csiszar and Korner [9]):For every n-type P and every distribution Q on A,Qn(xn1) = 2−n(H(P )+D(P ||Q)) if xn1 ∈ τPn

Theorem 5 (Csiszar [10]):Let π and τ be distributions on A∞, and let π1 and τ1 be their first order marginalsrespectively.

For testing the null hypothesis π1, the test with critical region{xn1 : D(π(xn1)||π1) ≥ δn}, δn → 0, such that δn.

nlog(n) →∞,

has type I error probability → 0,

and for each τ1 6= π1, the type II error probability goes to zero at exponential rateD(π1||τ1)

Theorem 6 (Gopalan and Bansal[11]):Let Pn denote the type of a random sample of size n drawn from P. Then

Pr(D(Pn||P ) ≥ δ) ≤(n+|A|−1|A|−1

)2−nδ, ∀ δ ≥ 0

Our Algorithm:

Let xN1 be a given sequence in which we have to estimate the change point.

Step 1: For every m such that 1 ≤ m ≤ N − 1, we break the sequence into twosubsequences xm1 and xNm+1 respectively.

Step2: We compress each subsequence separately using a universal code and for each

11

m, compute the sum of the compressed lengths of the subsequences.

Step3: We find the m which minimizes the sum computed in Step2. This is our es-timate of the change point.

SIMULATION STUDIES

We ran simulations of this algorithm on various samples with changes from i.i.d. to i.i.d.processes, i.i.d. to Markov and Markov to i.i.d.

The results are shown below.Since there can multiple ms obtained from the algorithm, we take the highest, lowestand median of all such ms and show all these results.

12

Case I: i.i.d. to i.i.d.

13

Case II: i.i.d. to Markov

17

Case III: Markov to i.i.d.

A note on simulations: In order to reduce runtime and enable a sample size of 80000to be analysed, we used the approach suggested in Deekshith and Bansal[14], pp 28–29,Method 3.

ZIV’S EMPIRICAL ENTROPY

In [15], Lempel and Ziv have defined the “empirical entropy” for any sequence.Given w ∈ Al, an arbitrary sequence of length l over A, and an input string xn1 ∈ An, wedefine

δ(xi+li+1, w) =

{1, w = xi+li+1

0 otherwise

18

0 ≤ i ≤ n− l

Hence P (xn1 , w) = 1n−l+1

n−l∑i=0

δ(xi+li+1, w) represents the relative frequence of occurrence

of w in xn1

P (xn1 , w) can also be interpreted as a probability measure of w ∈ Al

We can define a corresponding “normalised entropy” as

Hl(xn1) = − 1

l logα

∑w∈Al

P (xn1 , w) logP (xn1 , w)

where α is the number of symbols (2 for bits)

Now, Lempel and Ziv [15] define Empirical Entropy as

H(x) = liml→∞

[lim supn→∞

Hl(xn1)]

Theorem 7 (Ellis[16]):

The empirical entropy is an affine function of the empirical distribution.If P and Q are empirical distributions, then for every 0 < λ < 1,H[λP + (1− λ)Q] = λH(P ) + (1− λ)H(Q)

CONVERGENCE OF THE ESTIMATOR: A JUSTIFICATION

We consider the case where both the distributions before and after change are i.i.d.

Let xN1 be a sequence in which the first n symbols (i.e. xn1) are i.i.d and from distribu-tion p0, while the (n+ 1)th to N th symbols are i.i.d. from distribution p1Let n

N = γLet |ψ(xba)| denote the compressed length of the sequence xba using LZ78Let γ be the estimate of γ based on our algorithm.Let us first assume that γ is greater than γ.Then,|ψ(xγN1 )|+ |ψ(xNγN+1)|∼ [γH((γγ )p0 + (1− γ

γ )p1) + (1− γ)H(p1)](Theorem 7)

19

By convexity of the function H(.), the above expression is greater than

γ((γγ )H(p0) + (1− γγ )H(p1)) + (1− γ)H(p1)

= γH(p0) + (1− γ)H(p1) —– (1)

It can be easily shown that the same lower bound holds when γ < γ.

As n→∞, the lower bound converges to γH(p0) + (1− γ)H(p1)Also, the lower bound is attained when we try to minimize the sum of the compressionratios w.r.t. γ.Therein lies the justification for minimizing the sum of the compression ratios.

SIMULATION STUDIES WITH LZ77

As in an earlier section, we ran simulations of this algorithm on various samples withchanges from i.i.d. to i.i.d. processes, i.i.d. to Markov, Markov to i.i.d. and Markov toMarkov, with LZ77 as the universal code.

Since there can multiple m′s obtained from the algorithm, we take the lowest devia-tion, highest index and median of all such m′s and show all these results, as before.

20

Case I: i.i.d. to i.i.d.

21

Case II: i.i.d. to Markov

23

Case III: Markov to i.i.d.

24

Case IV: Markov to Markov

ANOTHER VERSION OF THE ALGORITHM

From the ideas used in the justification for the convergence of the earlier estimator, wehave derived another form of the algorithm. The basic idea is same: estimating thechange point using an universal code.Suppose we compute the compressed length L(m) = |ψ(xm1 )| for each m from 1 to N, andplot it versus m.For m < n,L(m) ∼ H(p0)m

For m > n,L(m) ∼ [H(p0)n+ (m− n)H(p1)]

25

Hence at m = n, there is a change in slope of L(m).Thus n can be estimated by estimating the point where this slope change takes place.This method can also be used to detect multiple changes in the data. This aspect isdiscussed in a later section.In the following section, we have simulated the performance of this method for changefrom iiid to iid, iid to Markov and vice versa, and Markov to Markov (single change-pointcase), for both LZ78 and LZ77. The threshold for the change in slope is kept at 0.09,ie if the change in slope between consecutive points is greater than or equal to 0.09, achange-point is declared.

SIMULATION STUDIES FOR THE MODIFIED VERSION

Case I: i.i.d. to i.i.d. with LZ78

26

Case II: i.i.d. to Markov with LZ78

28

Case III: Markov to i.i.d. with LZ78

29

Case IV: Markov to Markov with LZ78

30

Case V: i.i.d. to i.i.d. with LZ77

31

Case VI: i.i.d. to Markov with LZ77

33

Case VII: Markov to i.i.d. with LZ77

34

Case VIII: Markov to Markov with LZ77

DETECTION OF CHANGE IN TEXT

In this section, we have applied both the minimum-sum and slope-change versions ofthe algorithm (with both LZ77 and LZ78) to detect change-point between two extractsof text, one taken from Agatha Christie [17] and the other from Ernest Hemingway[18].The sample size is kept at 80000. The results are as follows:

35

From simulations, it is seen that LZ78 performs slightly better in the minimum-sum ver-sion, whereas LZ77 performs slightly better in the slope-change version. This can beexplained as follows. Since LZ78 has a smaller redundancy, it approximates the distribu-tion somewhat better for finite sample size. On the other hand, since LZ77 has a slidingwindow, so its database is more recent, hence it learns the new distribution more quickly,giving better results in the slope-change scheme.

DETECTION OF MULTIPLE CHANGES

In this section, we have used the LZ77 and LZ78 algorithms to estimate multiple changepoints in a data (with number of change-points unknown). We have taken 2 change pointsin the data, and used the slope-change framework. The changes are all from iid to iid.The results shown below represent the Euclidean distance of the estimated change pointvector (γ1, γ2) from the actual change point (γ1, γ2), in all those cases where the presenceof exactly two change points were predicted by the algorithm. In addition, another set

39

of graphs represent how many times the prediction of the no. of change-points itself waswrong.

40

FUTURE SCOPE

Shamir, Merhav and others ([19], [20], [21]) have studied the redundancy rates of arith-metic codes under scenarios with change point. However, no similar study has beenconducted, to our knowledge, with LZ codes. If some results on redundancy rates ofLZ77 and LZ78 are available, one can find the rate at which the above estimators con-verge to the actual values.

43

REFERENCES 1. Jerzy Neyman, Egon Pearson (1933) "On the Problem of the Most Efficient Tests

of Statistical Hypotheses", Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 231: 289–337

2. G. Lorden, “Procedures for Reacting to a Change in Distribution”, Annals of

Mathematical Statistics, 1971, Vol 42, No 6, 1897-1908

3. Abraham Wald, “Sequential Tests of Statistical Hypotheses”, Annals of Mathematical Statistics, 1945, Vol 16, No 2, 117-186

4. Oliver Johnson, Dino Sejdinovic, James Cruisez, Ayalvadi Ganesh and Robert Piechocki, “Non-Parametric Change-point Detection using String Matching Algorithms”

5. En-hui Yang and John C. Kieffer, “On the Redundancy of the Fixed-Database Lempel–Ziv Algorithm for -Mixing Sources”, IEEE Transactions on Information Theory, Vol 43, No 4, July 1997

6. En-hui Yang, Lihua Song, Gil I. Shamir, and John C. Kieffer, “On the Pointwise Redundancy of the LZ78 Algorithm”, ISIT 2005, 4 – 9 September 2005, pp 495 – 499

7. Imre Csiszar and Paul C. Shields, “Redundancy Rates for Renewal and Other

Processes”, IEEE Transactions on Information Theory, Vol 42, No 6, November 1996

8. A.J. Wyner, “String Matching Theorems and Application to Data Compression and Statistics”, PhD dissertation, Stanford University, Stanford, CA, 1993

9. Imre Csiszar and Janos Korner, “Coding Theorems for Discrete Memoryless Systems”, Mathematical Institute of the Hungarian Academy of Sciences

10. I. Csiszar, “The Method of Types”, IEEE Transactions on Information Theory, Vol 44, pp 2505 – 2523, 1998

11. AK Gopalan and RK Bansal, “On Error Rate in Hypothesis Testing based on Universal Compression Algorithms”, IEEE Information Theory Workshop, 2006

12. Thomas M. Cover and Joy A. Thomas, “Elements of Information Theory”, 2nd Edition, Wiley 1999

13. A.D. Wyner and A.J. Wyner, “Improved Redundancy of a Version of the Lempel-Ziv Algorithm”, IEEE Transactions on Information Theory, Vol 41, No 3, May 1995

14. Deekshith Rao Juvvadi, “Sequential Change Detection Using Universal Estimators of Entropy and Divergence”, MTech Thesis, Department of Electrical Engineering, IIT Kanpur

15. J. Ziv and A. Lempel, “Compression of Individual Sequences via Variable-Rate Coding”, IEEE Transactions on Information Theory, Vol. IT-24, No. 5, September 1978

16. Richard S. Ellis, “Entropy, Large Deviations, and Statistical Mechanics”, Springer 1985

17. Agatha Christie, “The ABC Murders”, Collins Crime Club, January 1936

18. Ernest Hemingway, “The Old Man and the Sea”, September 1952

19. Gil I. Shamir and Daniel J. Costello, Jr, “Universal Lossless Compression of Piecewise Stationary Slowly Varying Sources”, in Proc. Data Compression Conference, 2001

20. Gil I. Shamir and Neri Merhav, “Low-Complexity Sequential Lossless Coding for Piecewise-Stationary Memoryless Sources”, IEEE Transactions on Information Theory, Vol. 45, No. 5, July 1999

21. N. Merhav, “On the minimum description length principle for sources with piecewise constant parameters,” IEEE Transactions on Information Theory, Vol. 39, pp. 1962–1967, November 1993

estimation of change point of distribution in a bit...

Documents