entropy statistics tests goodness of fit wald lagrange likelihood ratio

8/14/2019 Entropy Statistics Tests Goodness of Fit Wald Lagrange Likelihood Ratio

1/89

Cairo University

Institute of Statistical Studies & Research

Department of Mathematical Statistics

Tests Based on Sampling Entropy

By

Mohamed Soliman Abdallah

Supervised By

Prof.Samir Kamel Ashour Dr.Esam Aly Amin

Professor of Mathematical Statistics, Lecturer of Mathematical Statistics,

Department of Mathematical Statistics, Department of Mathematical Statistics,

I.S.S.R. Cairo University I.S.S.R. Cairo University


2/89

A Thesis Submitted to the Department of Mathematical Statistics

In Partial Fulfillment of the Requirements for the Degree of

MASTER OF SCIENCE IN

Mathematical Statistics

2009


3/89

Acknowledgement

All Gratitude due to ALLAH

I would like to extend my appreciation to my advisor Prof. Samir

Kamel Ashour for his support, guidance and patience during the

completion of this study and my graduate studies, really I feel very lucky

due to working with this kind of people. Special thanks for Dr. Esam Ali

Amin for his encouragement to complete this thesis.

I would like to express my gratitude to my parents, my brother,

my sister, Mr Ebrahim and Mr Tag Eldean and every one who helps me

throughout this work.

I can not forget to thank Wikipedia.com that made the process of

the research becomes easier and faster than before.

Summary


4/89

Goodness of fit tests play a vital role in the scientific and academic, so that a lot

of goodness of fit tests were suggested by the researchers throughout the previous

decades and each one has its own philosophy during deriving the test. In particular

this study deals in some details with tests based on sampling Shannons (1948)

entropy.

Mainly there are two schools that employ the sampling entropy for goodness

of fit, on one hand the non parametric estimation which contains two routes, first

Vasiceks (1975) approach that suggested to test the samples distribution by consider

the ratio between the observed entropy and the expected entropy, second the approach

of Arizono and Hiroshi (1989) that mentioned to test the samples distribution by

consider the difference between the observed entropy and the expected entropy.

On the other hand the parametric approach of Stengos and Wu(2004),(2007)

that proposed a flexible four tests which are derived by Lagrange multiplier test based

on principle of maximum entropy which is proposed by Jaynes (1957).

In this study we concentrated on the parametric approach by deriving the

Stengos and Wus (2004),(2007) tests by wald and likelihood ratio test instead of

Lagrange multiplier test, in addition operating simulated comparisons among

entropys estimators, which mentioned only in this essay, besides carrying out a

numerical comparisons among all tests based on Shannons (1948), finally proposed

some academic points that can be conducted in the future researches.

Table of Contents


5/89

Pages

Chapter (1) Introduction 1

Chapter (2) Definitions and Notation

2.1 Estimation Theory 4

2.2 Methods of Estimation 8

2.2.1 Method of Moments 8

2.2.2 Method of Maximum Likelihood 9

2.2.3 Method of Least Square 10

2.3 Hypotheses Testing 12

2.3.1 Introduction to Hypotheses Testing 12

2.3.2 Tests Based on Likelihood Function 13

2.4 Measures of Information 14

2.4.1

2.4.2

Shannon Entropy and Related Measures

Kullback Leibler divergence (Relative Entopy)

19

22

Chapter (3) Goodness of Fit Based on Maximum Entropy 253.1 ParametersEstimation Based on Maximum Entropy 25

3.2 Entropy Estimation Using Sampling m-Spacing 35

3.2.1 Entropy Estimation Using Vasiceks Estimator 35

3.2.2 Entropy Estimation Using Correas Estimator 41

3.2.3 Entropy Estimation Using Wieczorkowski et als estimators 42

3.3 Goodness of Fit Based on Maximum Entropy 45

3.3.1 Goodness of Fit Based on Likelihood Tests 46

3.3.2 Goodness of Fit Based on Sampling m-Spacing 58


6/89

Chapter (4) Monte Carlo Simulation of The Entropys Tests 67

4.1 Simulated Results for the Performance of Entropys Estimators 67

4.2 Power Comparisons Among Tests Based on Sampling Entropy 69

4.2.1 Simulated Results for Testing the Normality 69

4.2.2 Simulated Results for Testing the Uniformity 73

4.2.3 Simulated Results for Testing the Exponintiality 74

4.3 Extensions 76

Appendices 77

Appendix(1): Tables 78

Appendix(2): Programs 128

References 160


7/89

Chapter (I)

Introduction

Testing the distribution of the sample has a long been an interesting issue in

the body of the literature, it is considered a key word in statistical modeling and

potentially useful developing statistical methodology, in particular in this essay it is

concerned with the tests based on the sample Shannons (1948) entropy.

Entropy as a statistical concept was formulated by Shannon (1948) to measure

the uncertainly or the size of the information in the sample, such that increasing

entropy denotes to less information and more uncertainly, in addition Kullback and

Liebler (KL) (1951) proposed an indicator to measure the divergence between two

sample via making a comparison between the amount of the information which can be

obtained from each sample, so that high values of KL refers to wide area between he

two samples and vice versa.

Jaynes (1957) utilized the Shannon entropy and proposed a flexible tool for

estimation the probability of the events using the prior information for instance the

average or the variance of the events, then Singh and Rajagopal (1986) extended this

tool for estimating the parameters of the frequency distribution.

Vasciek (1975) discovered a new route for testing the normality based on

sampling entropy, mainly the problem that faced him is how to estimate the sampling

entropy, he preferred to estimate the entropy function using m spacing , after that

sampling entropy is considered a famous issue and is reported in a variety fields, so

that there are four different entropys estimators based on m spacing without


8/89

Vascieks (1975) estimator, this study will concentrate on Correas (1995) estimator

and the two estimators of Wieczorkowski ad Grzegorzewski (1999).

Dudewicz and Van der Meulen (1981) extended Vascieks (1975) approach

for testing uniformity, however Gokhale (1983) applied Vascieks (1975) idea on a lot

of distributions, Taufer(2002) proposed a new idea for testing the exponintiality via

the two transformations which proposed by Seshadri and Csorgo (1969), his idea

operates any one of the two transformations on the data then operating Dudewicz and

Van der Meulens (1981) test, that if the transformed data is really uniform hence one

can conclude that the original data follows exponential distribution and vice versa.

Arizono and Ohta (1989) proposed an interesting idea for testing the

normality by utilizing KL to test the divergence between the observed sample and

expected sample under the null hypothesis, moreover Ebrahimi and Habibullah

(1991) applied KLs approach on exponential distribution, further Mao(2002) applied

the same approach in the multivariate distributions.

Stengos and Wu (2004), (2007) proposed a nother way for testing the

normality based on entropy by Lagrange multiplier test that they used Jaynes (1957)

idea for deriving four flexible normality tests, mainly this idea can be regarded as the

parametric approach because it doesnt need to estimate entropy function.

Mainly, the main purposes for this study can be summarized as following:

1. Review of both the parametric and the non parametric goodness of fit tests based

on sampling entropy.

2. Instead of deriving only the Lagrange multiplier test for testing the normality as

Stengos and Wu(2004),(2007) did, it isnt worth nothing to drive wald and likelihood

ratio test for testing the normality, moreover making a comparison between the three

tests.


9/89

3. Investigating numerically the performance of the entropys estimators, which

based on m spacing.

4. Operating a numerical illustration among the non parametric tests and on the other

hand the parametric tests.

Indeed the components of this study will be organized in the following form:

1. The second chapter will divided into four parts, the first part will discuss some

topics in the estimation theory, but the second part shows in brief some methods of

estimation, however the third part will concentrate on testing hypotheses and some

common tests will be used in this study, finally the fourth part will deals with the

review of the Shannons (1948) entropy and some related measures of information.

2. The third chapter will essentially divided into three parts, the first part will focus

on estimation based on principle of maximum entropy, however the second partconcerned with some estimators of entropy based on m spacing, however the second

part explains both the parametric tests and the non parametric tests based on sampling

entropy.

3. Finally the fourth chapter contains the comments of the numerical results that is

concluded from Monte Carlo simulation as well as suggestions for future options of

academic researchers.


10/89

Chapter (II)

Definitions and Notation

This chapter is concerned with some important definitions and notation that

will be used in this study. The first section deals with some definitions associated to

estimation theory, then the second is concerned with different approaches of

estimation, however the third section is devoted to some topics in hypotheses testing,

finally the fourth section discuss Shannons entropy and some measures of

information.

2.1 Estimation TheoryIn most statistical studies the parameters of the population are unknown and

must be estimated from the sample because it is impossible or just too much trouble

(in terms of time or expensive) to look at the entire population, therefore estimation

theory has a vital role in statistical inference and is divided in to point estimation and

interval estimation.

Definition (2.1.1): Point estimation is a number obtained from computations on the

observed values of the random sample that serves as an approximation to the

parameters of the population.


11/89

It is important to point the difference between an estimate and corresponding

estimator, that an estimate is a particular value calculated from a specified sample of

observation, but an estimator is a random variable can be regarded as point

estimation. Of course, there are a lot of estimators corresponding to each parameter of

populationU , therefore one would probably obtain many different estimates forU ,thus it is required to discuss the criteria that make one estimator preferable to another.

Definition (2.1.2) Suppose U is a statistic from observed random sample and

consider a point estimator forU , we called U is an unbiased estimator forU iff :

E (U ) =U

If the previous condition valid in the large sample size, we called, U is asymptoticunbiased estimator forU .

Definition (2.1.3): Suppose1U , 2U are two estimators forU and if:

1)(

)(

)(

)(

2

1

2

2

2

1 !

U

U

UU

UU

MSE

MSE

E

E

then 1

U is more efficient than 2

U , where MSE refers to the mean square error of theestimator. If the previous condition valid in the large sample size, hence 1

U will be

asymptotic more efficient than 2U .

Definition (2.1.4) :The statistic U is consistent estimator forU iff:

1)|)((|lim pgp

\UUpn

where \ > 0

It is obvious that consistency is an asymptotic property and sometimes called

convergence in probability. If U is unbiased estimator for U and its MSE tends to

zero at large sample size, so U is consistent estimator forU .

Definition (2.1.5): An estimator U is said to be a sufficient statistic iff it utilizes all

information in a sample relevant to the estimation for U , that is meaning all the


12/89

knowledge about U can be gained from the whole sample can just as well as gained

from U only . In mathematical form, U is a sufficient statistic iff the conditional

probability distribution of the random sample given U is independent of U .

Definition (2.1.6): The probability density function (p.d.f) ),( Uxf considered as a

member of the exponential family if ),( Uxf can be rewritten as the following

formula:

( , ) ( ) ( )exp{ ( ) ( )}f x a b x c g xU U U!

Where )(Ua and )(Uc are two different functions in U , however )(xb and )(xg are

two different functions in x .Furthermore exponential family can be applied for more

than one parameter as follows:

1

( , ) ( )( ( ))exp( ( ) ( )))J

j j

j

f x a b x c g xU U U

!

!

One of the advantages from regarding ( , )f x U

belongs to exponential family that

1 2( ), ( ).. ( )i i j ig x g x g x can be considered as a joint sufficient statistics forU

.

Cramer- Rao proposed an inequality that proposed the lower bound for the

variance of the unbiased estimator .Assume U is unbiased estimator forU then:

2));(ln(

1)(

UU

Uxf

d

dnE

V u (2.1.1)

If the two sides coincided, then U is the best estimator forU among unbiased

estimators, from (2.1.1) inequality, we notice the following points:

1. Inequality (2.1.1) can take another equivalent formula:

22

2

1 1( )

( ln ( ; )) ) ( ln ( ; ))

Vd d

nE f x nE f xd d

UU U

U U

u !


13/89

2. The denominator of the (2.1.1) called fisher information I(U ), that is an index for

the size of the information in the sampling corresponding U . It is obvious more

information leads to more accuracy meaning less variability. If we haveU of order

1J v parameters, thus the fisher information called information matrix of order J Jv

can be expressed as:

2

2

( , ) 2

( ln ( , )) if i j

( )

( ln ( , )) if i j

i

J J

i j

dnE f x

dI

dnE f x

d d

UU

U

UU U

!

!

{

Definition (2.1.7): Confidence interval is an interval determined by two numbers

obtained from the computations on the observed values that is expected to contain the

parameterU in its interior.

Let nXXX ..21 be a random sample from f(x;U ) , assume L = )..( 211 nxxxt and

U = )..( 212 nxxxt satisfies L


14/89

EU ! 1));..(( 21 bXXXQaP n

Converted to:

EU ! 1)),..(),..((212211

bXXXgaXXXgPnn

Where (a, b, E ) can be regarded as values free of U .

2. Obtain two values (a,b) in the domain of pivotal quantity, Where ba , which

minimizing the length of the interval:

),..(),..( 211212 aXXXgbXXXgLength nn !

Subject to

!b

a

n dxXXXQh EU 1)),..(( 21

Where )),..(( 21 UnXXXQh is the sampling distribution of ),..( 21 UnXXXQ .

Furthermore, Guenther (1969) concluded that the two sided confidence interval

based on symmetric distributions can be considered as the shortest confidence interval

because of the systematic of the distribution, however confidence interval based on

asymmetric distribution cannot be considered the shortest confidence interval, so he

recommended for using a table which proposed by Tate and Klett(1959) for the shortest

confidence interval based on chi-square distribution corresponding different levels of

sample size and various levels of significance .

2.2 Methods of EstimationAfter established some criteria which judge on the performance of the

estimators, hence it is required to discuss in brief the methods of estimation the

populations parameters, mainly many methods were proposed for construction the

estimators in the statistical literatures, however this section is concerned with three

methods of estimation.


15/89

2.2.1 Method of MomentsIt is difficult to track back who introduce the method of moments MOM ,

but Johnan Bernoulli(1667-1748) was the first who used the method in his work (see

Gelder(1997)), this method based on solving simultaneously a system of J equationsconsist of matching the observed sample moments with unobservable population

moments, where J refers to the number of the estimated parameters. Typically

different types of the observed sampling moments can be used as following:

1. The moments about zero (raw moments):

1 ( )

n

ij i

j

x

m E xn

!! !

2. The central moments:

2

1

( ) ( )

n

ij i

j

x x

E x xn

!

d! !

3. The standard moments:

1

( )

( )

nji

j i

j

x x

x xE

n

W

E W!

d

! !

Where and andx W refer to the mean and the standard deviation of the probability

density function (p.d.f) respectively. The method of moments in general provides

estimators, which are biased but consistent as large sample size, and not efficient, they

are often used because they lead to very simple computations, moreover may be used

as the first approximation or initial values to the solutions for other methods that need

for iteration. It is not unique, that instead of using the row moments, we can use the

central moments, therefore we obtain another estimators, unfortunately in some cases

MOM can not be applicable.

2.2.2 Method of Maximum LikelihoodIt is difficult to track who discovered this tool, but Bernoulil in 1700 was the

first who reported about it (see Gelder (1997)), the idea that it is required to give the


16/89

specified sample high probability to be drawn, so it is required to research about the

parameters that maximized the likelihood function for the specified sample. The

likelihood function is the joint density function for the completely random sampling

taking the following formula:

);();...(1

1 UU in

in

xfxx

!!

The method of maximum likelihood is required to estimate by searching for the

value of^

U that maximizes );...( 1 UnxxL , hence^

U is called maximum likelihood

estimator MLE, indeed obtaining^

U in many cases by solving the following equation:

0);...( 1 !U

Ud

xxd

n

In addition, the maximum likelihood method can be used to estimate J

unknown parameters by solving simultaneously the following homogenous J

equations:

1( . . . ; )0

n

j

d

x x

d

U

U ! 1..j J! (2.2.1)

Indeed U can not be obtained by (2.2.1) if the following conditions are not valid

(often called regularity conditions):

1. The first and second derivatives of the likelihood function must be defined.

2.

The range of Xs doesnt depend on the unknown parameters.3. The fisher information corresponding to each parameter greater than zero.

Typically solving (2.2.1) cannot be easily, thus one can use monotonic

transformation that making the calculation easier:


17/89

11

ln ( ; )ln ( ... ; )

n

in

i

j j

d f xd

x x

d d

UU

U U

!!

In general, MLEs estimates are asymptotically unbiased or consistent

estimators for the parameters. They have a powerful property called invariance, that

is, if U is MLE for , then^

)(Ug is MLE for g(). Furthermore MLEs estimates are

asymptotically normal distribution, so deriving confidence interval using MLE

estimates can be considered the shortest confidence interval when sample size is

large. If there is an efficient estimator for that achieves the Cramr-Rao lower bound

, it must be MLE. If U is a location parameter then UU is a pivotal quantity; also if

U is a scale parameter thenU

U is a pivotal quantity.

2.2.3 Method of Ordinary Least Squares

The method of least squares or ordinary least squares (OLS) is often has a vital

role in the statistical researches, particularly regression analysis, and has historically

much older than method of moments and method of maximum likelihood, it is

interesting to note that it is proposed by Gauss (see Gelder(1997)). Typically OLS

used to estimate the relation between two variables are known as independent and

dependent variables. Least squares problems fall into two categories, linear and non-

linear models. The linear least squares problem has a closed form solution in the

simple linear models, however in the multiple linear models and the non-linear

problem does not, this study will focus on one independent variable. Suppose there is

a theoretical relation between Y and X that can be expressed as:

iii UXBBY ! 10 ni ..1!

Where:

iY : represents the dependent or response random variable.

iX : represents the independent fixed variable.


18/89

iU : is random variable represents the residual of the model.

For estimating 0B and 1B , one can suggest that obtaining the estimators that

minimizing1

n

i

i

U!

, but since the residuals either positive or negative thus might

adding up will be small although poor estimators , however for avoiding this problem

it can resort to minimizing1

n

i

i

! , but sums of absolute values are not convenient to

work mathematically, therefore to overcome this difficulty OLS states that 0B and 1B

should minimize !

n

i

i

1

2 as possible, so that taking the partial differential with respect

to 0B and 1B respectively:

2 2

1 10 1 0 1

1 10 1

2( ) and 2 ( )

n n

i in n

i ii i i i i

i i

d d

y B B x x y B B xdB B

! !

! !

! !

Obtaining OLS estimators as:

0 1 0 1

1 1

2( ) 0 and 2 ( ) 0n n

i i i i i

i i

y b b x x y b b x! !

! ! )2.2.2(

Solving simultaneously (2.2.2) to obtain 1b and 0b :

xbyb

xnx

yxnyx

bn

i

i

n

i

ii

10

2

1

2

11 and !

!

!

!

In general estimators based on OLS can be identical with MLE if the

normality assumption is assumed (see Mood et al. (1974)), further OLS estimators

have a powerful property comparison with other linear estimators in the dependent

variable known as Gauss and Markov theorem states in which the errors have

expectation zero conditional on the independent variables, are uncorrelated and have


19/89

equal variances that 2)/()/( W!! iiii XVarXYVar , then OLS estimator

will be unbiased estimator for 0B and 1B and are more precise, less variability, than

any other unbiased estimators belong to the class of linear function in the response

variable, in other words among all linear unbiased estimators OLS estimators have a

smallest dispersion in repeated samples at fixed explanatory values, this properties is

well known as the best linear unbiased estimator ( BLUE).

2.3 Hypotheses TestingA statistical hypothesis tests is a method of making statistical decision using

sampling data, it is consider a key technique of statistical inference. The aim of using

hypotheses tests is to use the information in the sample to guide us to accept or reject

the doubtful hypothesis called null hypothesis oH against the alternative

hypothesis1H .

2.3.1.Introduction to Hypotheses TestingIn fact there are two types of hypotheses testing in the academic literature that

are classified as:

1. Parametric Hypothesis that is considered with one or more constraints imposed

upon the parameters of certain distribution.

2. Non-Parametric Hypothesis that is the statement about the form of the cumulative

distribution function or probability function of the distribution from which sample is

drawn.

Definition (2.2.1) : Critical region C(nXXX ..21 ) is a subset of the sample space,

where sample space consists of all possible samples that can be drawn from the

population at fixed sample space, for which oH is rejected. Indeed C( nXXX ..21 )

plays a significance role for accepting or rejecting the null hypothesisoH .

Definition (2.2.2): Test statistic T( nXXX ..21 ) is a rule or a procedure for deciding

whether or not reject the null hypothesis based on its sampling distribution, so that

the decision is rejecting the null hypothesis iff :


20/89

)..()..( 2121 nn XXXCXXXT

Typically alternative hypothesis can be classified as following:

1.Simple Hypothesis: if the statistical hypothesis specifies the probability distribution

completely.

2. Composite Hypothesis: if the statistical hypothesis doesnt specify the probability

distribution completely.

It is noted that o

should be take simple hypothesis to enable us for deriving

the sample distribution of the test statistic. Actual accepting or rejectingo

based on

the sampling data instead of the population, therefore the decision can be affected

with two kinds of errors.

1.Type I error E : this error can be done when we reject o

although it is correct,

also called the level of significance:

E! )/)..()..(( 2121 correctisHXXXCXXXTp onn

2.Type II error F : this error can be done when we reject 1H although it is correct, the

complement ofF called the Power of the Test F1 :

F! )/)..()..(( 12121 correctisHXXXCXXXTp nn

2.3.2.Tests Based on Likelihood FunctionMainly we need test statistical that keeps the two errors of the decision as

minimum as possible, unfortunately with a fixed sample size if one of errors was

minimized the other was maximized, so there is a negative relation between the two

errors, for overcoming this problem we can fix the more serious error, type I error,

and searching for the statistical test, which has the minimum type II error, or most powerful test. Indeed there are various approaches for driving most powerful tests,

(see Engle(1984)), this study is concerned with Likelihood Ratio Test (LR), The Wald

Test (WT), Lagrange test (score test) (LM) .


21/89

1. Likelihood Ratio Test (LR)This test was proposed by Marriott in 1990 (see Han 2002), the tests is operated

by obtaining the ratio between two likelihood functions one evaluated under restricted

parameters space and the other calculated under unrestricted parameters space,

suppose we have1 2

( , ,.. ; )n

f x x x U

and it is required to test:

ooo HSVH

{! UUUU :..: 1

Hence likelihood ratio testwill be:

11 0

1 1

0

1

ln ( ... ; )ln ( ... ; )

ln ( ... ; ) ln ( ... ; )

n

n

J

n n

H

H

Max

x xMax

x x

RTMax

x x

x x

U

U

UU

U U

! !

WhereU is a vector 1J v of the tested parameters and

U refers to MLE of

U , so that

LRT lies between zero and one, therefore large values of LR as evidence agreement

betweeno

U and MLE that can enable us to accept oH , while small values of LRT

guides to reject oH . Because it is not well known the sample distribution of LRT it is

recommended for using the following formula (see Engle (1984)):

1 0 12( ln ( ... ; ) ln ( ... ; ))n nL R Max L x x L x xU U

!

It is proved that LR has chi square distribution with J degrees of freedom for large

sample size, so one can conclude to reject oH if ,JLR EGu .

2. Wald Test (WT)There is another test for testing the agreement between the null and alternative

hypothesis was proposed by Wald in 1943 (see Han 2002), the idea based on testing


22/89

whether the distance betweenU and

oU is significance large or not via the following

formula:

0 0 ( )( ( ))( ) t

JWT IU U U U U

!

Since MLE always has asymptotic normal distribution thusJ

WT works by

standardizingU so that approximately distributed standard normal, then taking the

square to be limiting chi square distribution with J degrees of freedom. Suppose it is

required to test hU

as a subset ofU where:

1 2

{ )h h

U U U U

! L where h J

First partitioningU as:

{ )h j hU U U

!

Second partition the variance covariance matrix ofU as:

1

1

1

( ) ov( , ) ( ) ( )

ov( , ) ( )

h h J h

h J h J h

I C

V IC I

U U UU U

U U U

! ! -

Where Cov refers to the covariance matrix between two vector, hence hWT will be:

1 1 ( )( ( )) ( )th h o h h oWT IU U U U U

!

HencehWT has limiting chi square distribution with h degrees of freedom.

3. Lagrange Multiplier TestIn statistical inference there is a well-known test related to Lagrange multiplier

(LM) for testing hypothesis concerned with the parameters of the distribution.


23/89

Aitcheson and Silvey (1958) proposed the Lagrange multiplier test (score test) which

derives from a restricted maximum likelihood estimation using Lagrange multiplier,

therefore first it is required to explain in brief about Lagrange multipliers, then discuss

the score test.

In mathematical optimization, the method of Lagrange multipliers provides a

strategy for finding the maximum or minimum of the objective function subject to

constraints, this method is due to Jeseph Louis Lagrange, suppose it is required to

obtain the extreme values of the objective function:

)..,( 21 nxxxf

Subject to

1 2( , .. ) 0

i ng x x x ! 1..i m!

Where )..,( 21 nxxxf and )..,( 21 ni xxxg are differentiable functions and m n

. This

method required to obtain the Lagrangian equation, then differentiating the

Lagrangian equation with respect to sxd and sPd, where sPd refer to the Lagrange

multipliers, that yield nm equations, finally solving the nm homogenous

equations in nm unknowns to reach the values of sxd that represent the extreme

values of )..,( 21 nxxxf and in the same time satisfying the m conditions for more

details see Thomas (2005).

For recognizing if the solution represents the maximum, minimum or saddle

values, it is required to calculate the determinant of the Hessian matrix nn v that has

the following formula:


24/89

2

1 2

2

( , ) 2

1 2

( , .. )

( )( , .. )

o o no

i

n n

o o no

i j

d f x x xfor i j

d xHess x

d f x x xfor i j

dx dx

!

!

{

Where )..,( 21 nooo xxx are the values of sxd which satisfied the m conditions. Thus via

the determinant of the Hessian matrix one can recognize if the solution represents the

maximum, minimum or saddle values as follows:

1. )..,( 21 nooo xxx are classified as minimum values for )..,( 21 nxxxf iff 0"Hess .

2. )..,( 21 nooo xxx are classified as maximum values for )..,( 21 nxxxf iff 0Hess .

3. )..,( 21 nooo xxx are classified as saddle points for )..,( 21 nxxxf iff 0!Hess .

In calculus of variation, there is a fundamental equation that based on

Lagrangian equation is known as Euler-Lagrange equation, actually Euler-Lagrange

equation is useful for solving an optimization problem in which the objective function

is donated as a functional, function of function, and one seeks the function that

maximizing or minimizing it. To see this point suppose it is required to seek ( )f x that

maximizing the following functional:

( ( ), ( ), )F f x f x xd

Where ( )f xd denotes as the first derivative of ( )f x with respect to x. The solution,

without proof, will be according to Riley et al. (2006):

( ( ), ( ), ) ( ( ), ( ), )( )

( ) ( )

dF f x f x x d dF f x f x x

df x dx f x

d d!

d

If the functional doesnt contain ( )f xd , hence the Euler-Lagrange equation will

become:

( ( ), ( ), )0

( )

dF f x f x x

df x

d!

An excellent example that making the idea more clearly is the principle of maximum

entropy method that will be explained later.


25/89

The idea of LM that is required to maximized );...( 1 UnxxL with respect to the

null hypothesis 0UU ! , hence the Lagrangian function can be expressed as:

)();...(),( 01 UUPUPU ! nxxLLagr

Differentiating ( , )Lagr U P with respect to U and P then setting to zero, it will yield:

0),...(),( 1 !! P

U

U

UPU

d

xxdL

d

dLagr n and 0),(

0 !! UUPPU

d

dLagr(2.3.1)

One can solve (2.3.1) simultaneously by obtaining the derivative of the );...( 1 UnxxL

with respect toU , then replacing U with 0U into the derivative, thus it yields:

PU

U

UPU

!!d

xxdL

d

dLagr n ),...(),( 01 (2.3.2)

Typically (2.3.2) known as the score function )( 0US .Since U is often unknown so it

will be estimated by MLE, hence smaller value of )( 0US will agree with 0U is close to

MLE and accept the null hypothesis, otherwise reject 0U is MLE, thus score test

measures the magnitude between the tested value ( 0U ) and MLE via testing if )( 0US

is significance different from zero or not. It is notice that under oH the mean and the

variance of )( 0US are zero and the fisher information )(UI respectively, thus LM can

be written as:

)(

))((

0

2

0

U

U

I

SLM !

Mainly LM has chi-square distribution with one degree of freedom for large sample

and under the null hypotheses, for more details (see Judge et al. (1982)). Suppose we

have J parameters and it is required to test them simultaneously then LM test has the

following formula:

1( ) ( ) ( )to o oLM S I SU U U

! (2.3.3)


26/89

Where )(oS U refers the score function of the vector

oU ,

1)(

oI U refers to the inverse

of the information matrix of orderJ Jv , taking the following formula respectively:

1( ... , )( )

n

j

d L x xS

d

UU

U

! and

2

1

( , ) 2

1

( ln ( ... ; ))( )

( ln ( ... ; ))

n

i

J J

n

i j

d

E L x x f or i jdI

dE L x x f or i j

d d

UUU

UU U

!!

{

It is proven that (2.3.3) has chi-Square distribution with J degrees of freedom (see

Engle (1984)), further an interesting relationship between the three tests can be

represented geometrical when U is one dimension as follows:

Figure (1): The likelihood tests in one dimension


27/89

2.4 Measures of InformationA great variety of the informations measures are proposed in the literatures

recently (see Estban and Morales (1995)), since Shannon (1948) was the first one who

written about measuring the samples information and has a huge contribution for

development the information theory, thus this section is concerned with Shannons

entropy .

2.4.1. Shannon Entropy and Related MeasuresThe origin of the entropy concept goes back to Ludwig Boltzmannin (1877), it

is a Greek notation meaning transformation, it has been given a probabilistic

interpretation in information theory by Shannon (1948). He consider the entropy as

index of the uncertainty associated with a random variable expressed in nats, where


28/89

nat (sometimes nit or nepit) is a unit of information or entropy, based on natural

logarithms.

Definition (2.4.1): Let there is n events with probabilities nppp ..21 adding up to 1,

Shannon (1948) stated the entropy corresponding these events take the following

formula:

!

!n

i

iixpxpXH

1

)(ln)()( (2.4.1)

he claimed that via (2.4.1) one can transform the information in the sample from the

invisible form to numerical physical form so the comparisons can easily made and can

be understood, further it can be regarded as the variance for the qualitative data.

Assumek

nnn .., 21 be the number of each categories occurs in the experiment of

length n, where:

nnk

i

i!

!1

andn

np ii !

Shannon (1948) mentioned that the all possible combination that partition n into k

categories of size kn can be indicator for the accuracy of any decision associated to

this sample (see Golan (1996) and Mack (1988)), one can present the numbers of all

possible combination as:

!!..!

!

21

..2,1k

n

knnn nnn

nCW !! (2.4.2)

It is obvious that (2.4.2) is always greater than or equal to one and less than1

!

( !)kn

n, if

(2.4.2) equals one this indicator for the sample has one category and that refers to the

maximum of accuracy and minimum uncertainty, for more simplicity Shannon (1948)

preferred to deal with logarithm of W as follows:

!

!k

i

innW1

!ln!ln)ln(


29/89

Using Striling approximation that states:

gp} xasxxxx ln!ln

Thus ln(W) will be:

!!

}k

i

i

k

i

ii nnnnnnW11

lnln)ln(

!

}k

i

iinnnn

1

lnln1

ln lnk

i i

i

n n n np!

! 1

ln (ln ln )k

i i

i

n n n n p!

!

1 1 1

ln ln ln lnk k k

i i i i i

i i i

n n n n n p n p

! ! !

} !

Therefore on can conclude:

)(ln)ln(1

1 pHppWnk

i

ii!}

!

Typically Shannons (1948) entropy can be regarded as a measurement of the

accuracy associated to the decisions sample in average. Equation (2.4.1), according

to Shannon (1948), satisfies the following properties:

1 The quantity )(XH reaches a minimum, equal to zero, when one of the events is

a certainty, assuming 0)0ln(0 ! , and )(XH reaches the maximum when all the

probabilities are equal, hence )(XH can be regarded as a concave function, for

instance suppose an experiment has two outcomes , then the entropy can be:

Figure (2): The curve of H(p) in one dimension


30/89

2 If some events have zero probability, they can just as well be left out of the

entropy when we evaluate the uncertainty.

3. Entropy information must be symmetric that does not depend on the order of the

probabilities. For continuous distribution (2.3.1) can take the following formula:

g

g

! dxxfxfXH ),(ln),()( UU

It can easy notice that entropy for the continuous variables satisfies Shannons (1948)

properties, but it can take negative values.

Definition (2.3.2): Joint entropy is a measurement concerned with uncertainty ofthe two variables takes the following formula:

!

!n

i

iiiiyxpyxpYXH

1

),(ln),(),(

It is obvious that:

)()(),( YHXHYXH e

According to Shannon (1948) the uncertainty of a joint events is less than or equal to

the sum of the individual uncertainties and with equality only if the events are

independent.

Definition (2.3.3): Mutual information measures the information that X and Yshare, takes the following formula:


31/89

!

!n

i ii

iiii

ypxp

yxpyxpYXM

1 )()(

),(ln),(),(

It is obvious that 0),( !YXM if the two variables are independent.

Definition (2.3.4): Conditional entropy )/( YXH is a measure of what Y does notsay about X, meaning how much information in X does not in Y, takes the following

formula:

)(),()/( YHYXHYXH !

If the two variables are independent then the conditional entropy )/( YXH will

equal )(XH .

Remark: Definitions from (2.3.2) to (2.3.4) can be extended to the continuousvariables, via just replacing the summation symbol with the integration symbol. It can

realize that there is a relation between the measures of information as follows:

Venn diagram: relation between the informations measures

2.4.2.Kullback Leibler Divergence (Relative Entropy)Definition (2.3.5): Kullback and Leibler (1951) introduced relative-entropy or

information divergence, which measures the distance between two distributions of a


32/89

random variable. This information measure is also known as KL-entropy taking the

following formula:

!

!n

i i

i

iyq

xpxpYXKL

1 )(

)(ln)()/( (2.4.3)

Where ( ) and q( ) 0i i

p x y " , typically (2.4.3) can be regarded as the relative entropy

for using Y instead of X, actual there is a relation between )/( YXKL and ( )H X as

following:

1 1 1

( / ) ( ) ln ( ) ( ) ln ( ) ( ) ( ) ln ( )n n n

i i i i i i

i i i

KL X Y p x p x p x q y H x p x q y! ! !

! !

(2.4.4)

Thus (2.4.4) can be consider a good tool for discrimination between two distributions

(see Gohale (1983)). Indeed )/( YXKL has the following famous properties:

1. )/( YXKL isn't symmetry that:

)/()/( XYKLYXKL {

2. )/( YXKL is non-negative measure and it equals zero iff X and Y are identity:

iallforYXKL 0)/( u (2.4.5)

According to Lue (2007) ,(2.3.5) can be studied using the following identity :

0,)ln( "u yxforyxy

xx (2.4.6)

Hence, one can rewrite (2.4.3) according to (2.4.6) as:

0)(),()()()(

)(ln)(

111

"u !!!

ii

n

i

i

n

i

i

n

i i

ii xqxpforxqxp

yq

xpxp

0)(11

uu !

n

i

ixq


33/89

Thus one can conclude that iallforYXKL 0)/( u , indeed KL can be applied when

the variables are continuous via replacing the symbol of summation with integration

notation, furthermore also all the properties are valid, therefore it is recommended in

the literature for using )/( YXKL instead of H(x) for the continuous distribution (see

Dukkipati (2006)).

Chapter (III)

Goodness of Fit Based on Maximum Entropy

Statistical distributions are playing a vital role in the scientific researches,

since recognizing the probability distribution of the sample study denotes the key

word in many situations. There are many goodness of fit tests proposed in the

literatures to test the hypothesis that the drawn sample has a specificdistribution. This

chapter is organized as on one hand discussing parameters estimation based on

maximum entropy and estimation entropys function, on the other hand using

parameters estimation and estimation entropys function for fitting the distribution of

the sample.

3.1 Parameters Estimation Based on Maximum EntropyAccording to Jaynes (1957) the principle of maximum entropy approach

(POME) is a relative new method estimation and can be regarded as a flexible and

powerful tool for estimation the probability distribution. Using maximum entropy

method one should peak the probability distribution of the specified sample which

satisfies certain moments representing in one constrain or more, typically it can be the

mean, variance and skewness ..etc, and in the same time maximized samplings

entropy.

In the discrete case, estimating the probability distribution for representing

the sample by (POME) required:

1. Define the entropy for the available data.

2. Define the given or prior information in some independent constraints.


34/89

3. Maximizing the entropy function subject to some independent constraints.

In a mathematical form, it is required to maximizing:

!

!n

i

iixpxpXH

1

)(ln)()(

Subject to consistency constraints:

1)(1

!!

n

i

ixp and Jjcxpxg j

n

i

iij..1)()(

1

!!!

,

Where jc are constant numbers. Define the following Lagrangian function:

,..1

))()(()1)()(1()(ln)()),((111

Jj

cxpxgxpxpxpxpLagr j

n

i

ijj

n

i

io

n

i

iii

!

! !!!

PPP

WhereP denotes as a vector of Lagrange multipliers ),.., 1 Jo PPP , using

differentiation we have:

( ( ), )ln ( ) ( ) 0

( )

i

i o j j

i

Lagrd p xp x g x

dp x

PP P ! ! 1..j J!

Hence the mass function of maximum entropy will be:

( , )( ( ), )) exp( ( )o j jip i

x Lagr p x g xP P Pd

! 1..j J! (3.1.1)

It is easy to check that the general solution of (3.1.1) gives the maximum entropy. For

making the idea more obvious, according to Paul (2003), suppose there is a restaurant

has three meals {C,D,E} with {1$,2$,3$} respectively, if we have information that the

customer can spend in average 1.5 $ of the meal. Computing the probability of the

customer will demand each meal via (POME) as follows:

1.Define the entropy of the sample:

!

!3

1

)(ln)()(i

iixpxpxH

Whereix represents the price of the meal (i).

2.Define the given or prior information in independent constraints:


35/89

5.1)(1)(3

1

3

1

!! !! i

ii

i

ixpxandxp

Maximizing the following entropy function subject to two independent constraints,

using Lagrangian function as follows:

)5.1)(()1)()(1()(ln)()),((3

1

1

3

1

3

1

! !!!

i

ii

i

io

i

iiixpxxpxpxpxpLagr PPP

(3.1.2)

Differentiation (3.1.2) with respect to )(ixp and equality with zero yields:

1( , ) exp( )o ix ip xP P P!d (3.1.3)

For estimating different probabilities of the meals, substituting (3.1.3) in the two

independent constrains:

exp exp 2 exp 3 11 1 1( ) ( ) ( )o o oP P P P P P !

and

2 3exp exp 2 exp 3 1.51 1 1( ) ( ) ( )o o oP P P P P P !

Solving simultaneously the previous system gives:

843.,35. 1 !! PPo (3.1.4)

Substituting (3.1.4) in (3.1.3), thus it yields:

116.)(268.)(615.)( 321 !!! xpxpxp

Similarly for continuous distribution it is required to obtain ( , )f x U that

maximizing the following entropy function, objective function, as following:

( ) ( , ) ln ( , )H X f x f x dxU Ug

g

!

Subject to


36/89

( , ) 1f x dxUg

g

! and ( ) ( , ) 1..j jg x f x dx m j JUg

g

! ! (3.1.5)

Where ( , )f x U

satisfies the regularity conditions. To optimism the entropy function

subject to the conditions in (3.1.5), the Lagrangian function will be:

1

( , , ) ( , ) ln ( , ) ( 1)( ( , ) 1)

- ( ( ) ( , ) )

o

J

j j j

j

Lagr x f x f x d x f x dx

g x f x m dx

P U U U P U

P U

g g

g g

g

! g

!

=1

( , ) ln ( , ) ( , ) ( , ) 1

( ) ( , ) )

o o

j

j j j j

j

f x f x f x f x

g x f x m dx

U U P U U P

P U P

g

g

!

(3.1.6)

It can realize that (3.1.6) be a functional therefore differentiating (3.1.6) with respect

to ( , )f x U

using Euler Lagrange equation that yields according to Lue (2007):

10

( , ) ( )ln 0J

j jj

f x g xU PP!

!

Hence the maximum entropy density will be:

01

( , , ) exp( ( ))J

j j

j

f x g xU P P P !

! (3.1.7)

Where 0P called the normalized term and it is related to the other Lagrange

multipliers with the following formula:

0 1

ln( exp( ( ) ))J

j j j g x dxP P

g

!g!

(3.1.8)

Also (3.1.7) valid for any type of moment form. For making the idea more

obvious (see Radriguez (1984)), suppose it is required to search for the p.d.f. that

represents the sample given its variance equals two, hence the Lagrangian equation

will be:


37/89

2

1 1

( , , ) ( , ) ln ( , ) ( 1)( ( , ) 1)

( ( ) ( , ) 2)

oLagr f x f x dx f x dx

x f x dx

x U U U U

U U

P P

P

g g

g g

g

g

!

Where U is a vector of the parameters p.d.f. and 1U refers to the mean of the sample,

using (3.1.7) the maximum entropy density has the following formula:

2

0 1 1( , , ) exp( ( ) )f xx P U P P U

! (3.1.9)

Where:

2

0 1 1ln( exp( ( ) ))x dxP P Ug

g

!

2

1 1 1

1

exp( ( ) )

ln( )

x dxTP P U

TP

g

g

!

2

1 1 1

1

exp( ( ) )

ln( ) ln( )

x dxP P UT

P T

g

g

!

(3.1.10)

The second term of (3.1.10) is the normal distribution density, thus it will yield:

0

1

ln( )T

P

P

!

Substituting 0P in (3.1.9) gives:

2

1 11

1( , , ) exp( ( ) )f x x

PU P

TP U

! (3.1.11)

Actually (3.1.11) belongs to the normal distribution; hence the normal distribution has

a maximum entropy among all distributions subject to the fixed variance.

Singh and Rajagopal (1986) proposed a new approach for estimation the

parameters of the probability density distribution via principle of maximum entropy

(POME), in addition Singh et al. (1986) applied POME in various continuous

distributions, their idea generally consists of three steps summarized as:


38/89

1. Transforming the probability density function as a function in the Lagrange

multipliers instead of function in the parameters of the distribution.

2. Estimating the Lagrange multipliers.

3. Recognizing the relation between the Lagrange multipliers and the parameters of

the distribution.

Note that transforming the probability density function as a function in the

Lagrange multipliers instead of function in the parameters can be formulated via

inserting the probability density functions raw moments in the (3.1.5).

Estimating the Lagrange multipliersjP can be done by two ways, first one

can insert the maximum entropy density in (3.1.5) yields 1J nonlinear equations in

1J unknowns, and then solving numerically for reaching the suitable solution (seeZellener et al. (1988) and Wu(2003)). Second way is transforming the constrained

optimization problem into unconstrained optimization problem using the dual

approach (see Golan et al. (1996)), this idea can be summarized as:

a) Substitute (3.1.7) in the objective function thus it will become:

1

( ) ( , ) ln exp( ( ))J

od j j

j

H f x g x dxP P P Pg

!g

!

= 01

( , ) ( )J

j j

j

f x g x dxP P Pg

!g

Using (3.1.5) it yields:

0

1

( )J

d j j

j

H mP P P

!

! (3.1.12)

b) Hence the objective function (3.1.12) rely only on the Lagrange multipliers which

have the inverse relation with the objective function in (3.1.4), thus maximizing the

entropy required to minimizing (3.1.12) by obtaining the derivative with respect to jP

to satisfy the first condition:

j

J

jjj

jj

d

d

md

d

od

d

dH

P

P

PP

P

P ! ! 1

)(

j

j

J

j

jj

md

dxxgd

! g

g !

P

P )))(exp(ln(1


39/89

jJ

j

jj

J

j

jjj

m

dxxg

dxxgxg

!

g

g !

g

g !

))(exp(

))(exp()(

1

1

P

P

jJ

j

jj

J

j

jjj

m

dxxg

dxxgxg

!

g

g !

g

g !

)))(exp(exp(ln(

))(exp()(

1

1

P

P

Using (3.1.8) we have:

j

J

j

jjoj

j

d

mdxxgxgd

dH!

g

g !

1

)(exp()()(

PPP

P

jjmdxxfxg !

g

g),()( P (3.1.13)

To assure that the estimates of sPdas the minimum values of the dual entropy, one

should evaluate the second derivative of (3.1.13) as follows:

i

jj

ij

d

d

mdxxfxgd

dd

dH

P

P

PP

P

!g

g

),()()(

1( ) exp( ( ))

J

j j j

j

i

d g x o g x dx

d

P P

P

g

!g

!

1 1

( ) exp( ( )) ( ) ( )exp( ( ))J J

o j j j j i j j

j ji

dg x o g x g x g x o g x dx

d

PP P P P

P

g

! !g

!

1 1

( )exp( ( )) ( ) ( )exp( ( ))J J

o j j j j i j j

j ji

dg x o g x dx g x g x o g x dx

d

PP P P P

P

g g

! !g g

!

dxxfxgxgdxxfxgdxxfxgijji

g

g

g

g

g

g

! ),()()(),()(),()( PPP

JjixgxgExgExgE jiji ee! ,1))()(())(())(( (3.1.14)

The second derivative (3.1.14) would be a square matrix of order J, (known as

Hessian matrix), it is regarded as a variance covariance matrix (it is every where


40/89

positive definite), thus sPdcan be regarded as the minimum values of the dual

entropy.

The most serious step during estimation by POME recognizes the relation

between the estimated Lagrange multipliers and the parameters of the distribution.Generally it is required to compute (3.1.8) then inserted in the maximum entropy

density (3.1.7), finally making a comparison between (3.1.7) and the original

probability density. For making this idea more simplicity, letn

XXX .., 21 be a random

sampling of size n generated from normal distribution with ( 2,WQ ), it is clear that the

entropy function corresponding to normal distribution will be:

( ) ( , ) ln ( , ) H x f x f x dxU Ug

g

!

=

2

2

1 .5( )( , ) ln exp( )

2

x f x dx

QU

WW T

g

g

2( , ) ln ( , )( ) f x Adx B f x x dxU U Qg g

g g

!

}),(2),(),({)ln( 22 g

g

g

g

g

g

! QUUU xfxxfdxxxfBA

})2()({)ln(22

Q! xExEBA

Where2

1 1and

22A B

WW T

! ! . According to sighn et al. (1986) the sufficient

constraints for estimating ( 2,WQ ) represented in )}(),({ 2 xExE , hence for

transforming the density function as function in Lagrange multipliers one should

maximizing the following entropy function:

g

g!dxxfxfXH ),,(ln),,()(

22

WQWQ

Subject to:

1),,( 2 !g

g

dxxf WQ and 2..1),,( 2 !!g

g

jmdxxfx jj

WQ (3.1.15)


41/89

It is proved as a general solution in (3.1.7) that:

2

1

, exp( )( ) jo jj

f x xP P P!

! (3.1.16)

Estimating jP can be done by two ways, first one based on inserting (3.1.16) in

(3.1.15) yields three nonlinear equations in three unknowns, thus using any numerical

approach for reaching the solution, secondly based on transforming the constrained

optimization to unconstrained optimization via using dual approach. To obtain the

parameters of the distribution, first it is required to obtain0

N

P as follows:

2

0 ( )

1

ln( exp( )j N j N j

x dxP Pg

!g

! (3.1.17)

Actually (3.1.17) is nearly to the normal distribution, according to sighn et al. (1986)

the solution will be:

2

10 2

2

.5ln .5ln

4

NN N

N

PP T P

P! (3.1.18)

Substituting (3.1.18) in (3.1.16) yields:

2 21

21

2

( , ) exp

( .5ln .5ln )4

jN N jN

jN

f x xPP

T P PP !

!

222 1

1 2

2

2

2

exp( )

4

N N

N N

N

x xP

T

PP P

P!

221 1

2

2 2

22

( )2

exp 2 ( ) 4 2

N N

N

N N

N x xP

T

P PP

P P!

22 1 1

22

2 2

22

2

2exp( ( )) 2 4

N N

N

N N

N x xP

T

P PP

P P!

22 12

2

2

2

exp( ( ) )

2

N N

N

N

xP

T

PP

P! (3.1.19)


42/89

It is easy considered that equation (3.1.19) has a normal density with mean

1

2

2

N

N

P

P and variance

2

1

2N

P, so that:

21

2 2

1

and 2 2

N

N N

PQ WP P! ! (3.1.20)

In addition to make the calculations easier inserting (3.1.19) in (3.1.15)

yields:

1),( !g

g dxxf P , 1),( mdxxxf !

g

gP and 22 ),( mdxxfx !

g

gP

In the light of (3.1.20) the constrains will convert to:

1),( !g

g

dxxf P , 12

1

2

m

N

N !

P

Pand 2

2

1

2

2

2

1

2

1)

2

( mm

NN

N d!PP

P

HenceNN 21

PP and have a closed form as:

2

1and

2

2

2

11 d!d

!

mm

mNN

PP (3.1.21)

Stengos and Wu (2004),(2007) proved that there is an equivalent relation

between the density of maximum entropy ( , )f x P

and the original p.d.f. if it is

member of exponential family as follows:

1

( , ) exp(ln( ( )) ln( ( )) ( ) ( ))J

j j

j

f x a b x c g xU U U

!

! (3.1.22)

Comparing (3.1.22) with (3.1.7), it will be conclude thatoP is corresponding to

ln( ( ))a U

and ( )j jg xP is corresponding to ( ) ( )j jc g xU

and )(xb is one. Due to this

symmetric relation they concluded that until the density belongs to the exponential

family the parameters estimators based on either MLE or POME will be identical as

follows:


43/89

1

1 1 1

ln ( ... , ) ln( ( , )) ln(exp(ln( ( )) ln( ( )) ( ) ( )))n n J

n i j j

i i j

L x x f x a b x c g xU U U U

! ! !

! !

=1 1

ln(exp( ( )))n J

o j j

i j

g xP P! !

=1

( ) ( )J

o j j d

j

n m nH xP P!

!

Hence the values that maximized the likelihood function will be the same as the

values that minimized the dual entropy and equivalent to maximize the constrained

entropy function until the distribution belongs to exponential family, thus due to this

relation one can conclude the uniqueness of maximum entropys estimates in this

case.

Sighn (1996) applied maximum entropy approach on distributions where

regularity conditions not valid and concluding by Monte Carlo simulation that

maximum entropy yielded the least parameters bias for all sample sizes comparative

to others methods of estimation such as probability weighted moments, maximum

likelihood and method of moments, so that overall maximum entropy offers an

alternative method for estimating the parameters of the frequency distributions.

3.2 Entropy Estimation Using Sampling m-SpacingProbability density function estimation was proposed in the statistical literatures

by variety nonparametric methods (see Beirlant et al. (1997)). This section will

concern with density estimation based on m-spacing.

3.2.1 Entropy Estimation Using Vasiceks EstimatorLet nXXX .., 21 be a random variables of size 3un , let )()2()1( .., nXXX denote

the corresponding order statistics, the sample entropy can be defined as:

( )1 1 1( )( )( ) (ln( ( ))) (ln( ( ) ) (ln( ) ) (ln( ) )

1

n idF xdF xH x E f x E f x E E

dx dx

i n

! ! ! !

e eWhere )(xFn is the following empirical distribution function:

n

xnobservatioofnumberxFn

e!)(


44/89

According to Mood et al. (1976) it is proved that:

(sup ( ) ( ) 0) 1n

nP F x F x

pg p !

Vasicek (1976) estimated the slope of the cumulative function via replacing thecumulative function with the empirical distribution function and the differential

operator with the difference operator (see Mao (2001)), therefore the slope can take

the following formula:

)()()()()()(

)()()(

2)()()(

mimimimimimi

minminin

xx

n

m

xx

n

mi

n

mi

xx

xFxF

x

xF

!

!

!

(

((3.2.1)

Where m is a positive integer is chosen by the user and well known as window size,

mainly choosing m is a serious problem, typically it is recommended to peak m that

gives the least mean square error MSE corresponding to each sample size, hence

substituting (3.2.1) in )(xH it will yield Vasiceks (1975) estimator entropy:

!

!

!n

i

mimimimi

vasm

xxnn

m

xxnExH

1

)()(1)()( )))(

())(

(()(2

ln2

ln

Where )1()( XX mi ! for 1 mi and )()( nmi XX ! for nmi " . Indeed (3.2.1) can

be considered as application of mean value theorem, that if )(xf is a continuous on

closed interval ? Aba, and differentiable inside the interval, then there exists c in the

open interval (a,b) such that:

ab

afbf

dx

cdf

!)()()(

Due to )( xH has always a boundary bias during estimation based on spacing

when 1 mi or nmi " , Vasicek (1976) recommended that m should be less than

2

nto reduce the number of a boundary bias, further he proved that )( xH is an


45/89

unbiased estimator for )(xH when gpn gpm and 0pn

m. To see this point it is

required to decompose )( xH into three parts due to studying it is behavior as follows:

1

1

( ) ln ( )n

vas i mn mn

i

H x n f x U V!

! (3.2.2)

Where:

})()({2

ln1

)()(

1!

!

n

i

mimimn XFXFm

nnU and }

))((

)()({ln

1 )()(

)()(1!

!

n

i mimii

mimi

mnXXxf

XFXFnV

To double check from (3.2.2) as following:

1

1

1 1

( ) ( )

1 1

( ) ( )1

1 ( ) ( )

ln ( )

ln ( ) ln { ( ) ( )}2

{ ( ) ( )ln }

( )( )

n

mn mn

i

n n

i m i m

i i

ni m i m

i i m i m

n f x U V

nn f x n F X F X

m

F X F X n

f x X X

!

! !

!

!

1

( ) ( )

1

( ) ( ) ( ) ( )

1

{ ln ( ) ln ln( ( ) ( ))2

ln( ( ) ( ) ln ( ) ln( )}

n

i i m i m

i

n

i m i m i i m i m

i

nn f x F X F X

m

F X F X f x X X

!

!

!

-

!n

i

mimiXX

m

nn

1

)()(

1 )()(2

ln (3.2.3)

It is clearly that (3.2.3) equals (3.2.2), hence to study the behavior of )( xH it is

enough to study separately the properties of the three parts in (3.2.2). First the

expected value for 1

1

ln ( )n

i

i

n f x

!

equals )(xH at large sample size by law of large

number.


46/89

Definition (3.2.1): The weak law of large numbers states that ifHI

W

2

2

"n then the

sample average is unbiased estimator toward the population average:

HIQI u

! 1)( 1n

y

P

n

i

i

WhereQ , 2W refer to the mean and the variance of the population receptively, and

(I ,H ) is denoted as any two specified numbers satisfying I> 0 and 0 < H < 1 for

more details see Mood et al. (1976).

So if we donate )(ln ii xfy ! hence according to definition (3.2.1) for large

sample size the absolute difference betweeni

y and )(i

yE will be small with high

probability.Thus the two parts represent the sources of the noisy, fortunately it is

proven under some conditions that the two parts will be approximately vanish. For

fixed sample size the effect of mnV decreases with decreasing the values of m, since for

any interval ),( )()( mimi xx there exists ),( )()( mimi xxx d such that:

)()()(

)()(

)()(

i

mimi

mimixf

xx

xFxFd!

Therefore for decreasing the widow size will decrease the effect of mnV as following:

}))((

)()(ln{

1 )()(

)()(1!

!

n

i mimii

mimi

mnXXxf

XFXFnV 0

)(

)(

1

1ln p

d$

!

n

i i

i

xf

xfn

Also )( mnUE can be written as:

-

!

!

})()({2

ln)(1

)()(

1n

i

mimimnXFXF

m

nnEUE

)2ln()ln(})()(ln{1

)()(

1 mnXFXFEnn

i

mimi ! !

)2ln()ln(})()(ln{

})()(ln{})()(ln{

1

)()(

1

1

)()(

1

1

)1()(

1

mnXFXFEn

XFXFEnXFXFEn

n

mni

min

mn

mi

mimi

m

i

mi

!

!

!

!


47/89

Suppose )( )()( ii XFh ! , since )(ih has uniform (0,1), see Mood et al.(1976), therefore

the joint distribution of ),( )()( iji hhf takes:

1 1

( ) ( ) ( ) ( )

( ) ( )

! ( ) (1 )

( , ) ( 1)!( 1)!( )!

i j n i j

i i j i i j

i j i

n h h h h

f h h i j n i j

! ( ) ( )0 1i i jh h

Recognizing the p.d.f. of )()()( ijio hhc ! required obtaining the joint distribution of

)(oc and )(ih that yields:

jinhc

jc

ih

jinji

nhcf iooiio

! )1(1)(1)!()!1()!1(

!),( )()()()()()( 110 )()( oi ch

Hence the marginal distribution of ( )oc :

1 1

( ) ( ) ( ) ( ) ( ) ( )

0

1 ( )!

( ) ( ) (1 )( 1)!( 1)!( )!

i j n i j

o i o o i i

ocn

f c h c c h d hi j n i j

!

Using binominal expansion:

( )

( ) ( ) ( ) ( ) ( )

( )( )

1 ! 1 1( 1) ( ) (1 )

( 1) !( 1) ! !( ) !0 0

o

i o o i i

f co

cn i j nt i j n i j t th c c h dh

i j t n i j tt

!

!

=

1 ( )

1

( ) ( ) ( ) ( )

0 0

! 1( 1) ( ) (1 )( 1)!( 1)! !( )!

on i jt j

o o i i

t

cn n i j t i tc c h d h

i j t n i j t

!

= ( ) ( )! 1

( 1) ( ) (1 )( 1)!( 1)!( )! !( )

0o o

n i jn j n jt c c

i j n i j t t i tt

! 10 )( oc

Taking the following inequality in the consideration:

)1()!(

)1()1(

)(!)!(

)1(

0 +++

!

! jnjini

jini

itttjin

jin

t

t


48/89

Therefore )(oc has beta )1,( jnj , hence ( )(ln )oE c , according to Kendall and

Stuart (1969), can be computed by obtaining first( )

(ln(1 ))oE c as follows:

1

1

( ) ( ) ( )

0

( ) (1 ) ( , 1)j n jo o o

c c dc B j n j ! (3.2.4)

Obtaining the derivative of (3.2.4) with respect to n as:

1

1

( ) ( ) ( ) ( )

0

( , 1)( ) (1 ) ln(1 )j n j

o o o o

dB j n jc c c dc

n

!

Hence( )(ln(1 ))oE c will be:

( )(ln(1 ))

1 ( , 1) ln ( , 1)

( , 1)

oE c

dB j n j d B j n j

B j n j dn dn

! !

ln ( 1) ln ( 1)( 1) ( 1)

d n j d nn j n

dn dn] ]

+ + ! !

Where ( )x] is a digamma function has the following formula:

( ) ( ) / ( ) x x x] d! + +

Obtaining( )

(ln )o

E c required to calculate the derivative of (3.2.4) with respect to j as

follows:

1

1

( ) ( ) ( ) ( ) ( )

0

( , 1)( ) (1 ) [ln( ) ln(1 )]j n j

o o o o o

dB j n jc c c c dc

dj

! (3.2.5)

Hence (3.2.5) will be:

( ) ( )

1 ( , 1)(ln ) (ln(1 )

( , 1)o o

dB j n jE c E c

B j n j dj

!

= ( 1) ( 1) ( ) ( 1) ( ) ( 1)n j n j n j j n] ] ] ] ] ] !

Hence ( )mnE U can be computed as:

d


49/89

1 1

1 1

1

1

( ) { ( 1) ( 1)} { (2 ) ( 1)}

{ ( ) ( 1)} ln( ) ln(2 )

m n m

mn

i i m

n

i n m

E U n i m n n m n

n n m i n n m

] ] ] ]

] ]

! !

!

!

Arranging the terms:

)2ln()ln()1()1(2

)1()(()2()1()(1

1

1

1

1

1

mnnn

mn

n

mn

nn

mimnnmnminUE

n

mni

mn

mi

m

i

mn

! !

!

!

]]

]]]]

1 1 1

1 1

( 1) ( 2 ) (2 ) ( )

( 1) ln( ) ln(2 )

m n

i i n m

n i m n n m m n n m i

n n m

] ] ]

]

! !

!

1

22( ) ( 1) (1 ) (2 ) ( 1) ln( ) ln(2 )m

mn

i

mE U i m m n n mn n

] ] ]!

!

Using the fact that for large x, (see Pardo(2003))

1( ) ln( )

2x x

x] } (3.2.6)

Taking (3.2.6) in the view, thus ( )E Umn

will be:

1

2 2 1 1( ) ( 1) ln(2 ) ln(2 ) ln( 1)2

1+ ln( ) ln(2 )

2 2

mmn

i

mE U i m m m nn n m n

n mn

]

!

!

1

2 2 1 1 1( 1) ln(2 )

2 2 1

m

i

mi m m

n n m n n]

!

! (3.2.7)

According to (3.2.7), ( )Umn

E will tend to zero under the

assumptions gpn , gpm and 0pn

m, so that Vasicek(1975) concluded that under

these conditions )( xH can be asymptotic unbiased estimator for )(xH . Unfortunately

there is some problems in this proof that is treated by Song (2000).

3.2.2 Entropy Estimation Using Correas Estimator


50/89

Correa (1995) gave another estimator of entropy based on m spacing, he

claimed that its estimator is regarded as a modified estimator of Vasicek (1975), he

concluded that his estimator has mean square error less than Vasiceks (1975)

estimator using the simulation. His idea based on estimating)(

)(

i

i

dx

xdFthrough the

interval ),( )()( mimi xx , but instead of taking only the upper and the lower of the

interval ),( )()( mimi xx , he used all the points, 2 1m , in the interval ),( )()( mimi xx by

applying ordinary least squares OLS on the following model:

nimimijxBBxF jiijni ..1,..,where)( )(10 !!! I

For )()()1()( njj xxandxx !! when njandj " ,1 respectively, where iB1 can be

regarded as the slope of the empirical distribution function )( )( jni xF on the

observations of the sample with in the interval ),( )()( mimi xx , hence it is required to fit

n models each model can be rewritten as:

)(10 jixbbn

j! mimij ! ,..,

Where:

( ) ( )

12

( ) ( )

( )( )(2 1)

( )

i m i m

j i

j i m j i m

i i m

j i

j i m

j jx x n n m

b

x x

! !

!

!

2

0

( ) ( )

2

( ) ( )

( )(2 1)

( )( )(2 1)

( )

m

i mj

j i

j i m

i m

j i

j i m

j i m mj

x xn n m

x x

!

!

!

!

( ) ( )

2

( ) ( )

( (2 1) ( )(2 1)( )

(2 1)( )

( )

i m

j i

j i m

i m

j i

j i m

j m m i m m

n n mx x

x x

!

!

!


51/89

( ) ( ) ( ) ( )

2 2

( ) ( ) ( ) ( )

(2 1)( )

(2 1)( ) ( )( )

( ) ( )

i m i m

j i j i

j i m j i m

i m i m

j i j i

j i m j i m

j i m

n n m

j i x x x x

n n

x x x x

! !

! !

! !

(3.2.8)

Where:

! !

mi

mij

j

im

xx

12

)(

)(

Finally the estimate of the entropy will be:

1

1

ln( ) ( )n

i

i

corr

bH x

n!!

A numerical comparison was performed amongcorrxH )(

,vanxH )(

and

vasxH )( by Correa (1992) with different sample sizes 10, 20 , 50 each with m equals

1,2,3 and 4 associated to three distributions N(0,1), Uniform(0,1) and Exp(1), where

vanxH )( doesnt concluded in this study, he concluded that

corrxH )( has smallest

mean square error and didnt affect with the levels of window size m.

3.2.3 Entropy Estimation Using Wieczorkowski and Grzegorzewskisestimators:

Wieczorkowski and Grzegorzewski (1999) proposed three new estimators, itwill concentrate only on the modification of the estimators of Vasicek (1975) and

Correa (1995). First one can conclude from (3.2.5) that:

)()()),(ln())((1

1

mnmn

n

i

ivasVEUExfnExHE !

!

U

!

$m

i

min

nmn

mmnxH

1

)1(2

)1()2()2

1()2ln()ln()( ]]]

The bias of vasxH )( can be written as:

)1(2

)1()2()2

1()2ln()ln())()((1

!

$m

i

vasmi

nnm

n

mmnxHxHE ]]]

Mainly Wieczorkowski and Grzegorzewski (1999) decided to correct the bias of

vasxH )( as subtracting ( )

mnE U from vasxH )( as follows:


52/89

!

!m

i

vaswmi

nnm

n

mmnxHxH

1

1 )1(2

)1()2()2

1()2ln()ln()()( ]]] (3.2.9)

Actually (3.2.9) can be regarded as corrected Vasiceks (1975) estimators,

thus Wieczorkowski and Grzegorzewski (1999) surprised for not using Vasicek

(1975) )(1 xHw , secondly they proposed an estimator which modified Correa (1992)

by jackknife method, their idea consists of the following steps:

1. Let corrxH )( be an estimator for )(xH after removing one observation, thus it is

required to calculate corrxH )( n times.

2. Obtain

corrxH )( that has the following formula:

n

xH

xH

n

i

corr

corr

!

!1

)(

)(

3. Calculate the jackknife estimator 2)(

wxH that can be expressed as :

! corrcorrw xHnxHnxH )()()()( 12

It is obvious that 2)(

wxH has the same properties of corrxH )( , that is if

corrxH )( is unbiased estimator so are

corrxH )( and 2)(

wxH , however if corrxH )(

is

biased estimator then 2)(

wxH so is but less than, to proof this point, according to

Wasserman (2006), the bias for many statistics can often express as:

)1

())((32

nO

n

b

n

axHbias ! (3.2.10)

Where )( kn , according to Theil (1971), denotes that there is a sequence of terms

which the leading or the dominator term of order kn and )( kk nn is bounded

sequence for large n. When (3.2.10) holds so that:

)1

()1(1

))((32 n

On

b

n

axHbias corr

!


53/89


54/89

2)(

WxH works better than corrxH )( because has less bias, finally

vasxH )( doesnt

posses good statistical properties.

3.3Goodness of Fit tests Based on Maximum Entropy

The entropy of a random variable is playing a fundamental role not only in the

information theory but also in the testing hypotheses; indeed there are two different

approaches for goodness of fit via maximum entropy: tests based on likelihood tests

and tests based on sampling m spacing.

3.3.1 Goodness of Fit Based on Likelihood TestsStengos and Wu (2004),(2007) derived a general distribution tests based on the

method of maximum entropy density, their tests based on the fact that there is one to

one relation between maximum entropy density ( , )f x P

and the probability density

function for specified distribution, in other words they tested whether the maximum

entropy corresponding to the distribution under the null hypothesis, normal

distribution, is suitable for representing the specified sample via testing

0 3..j j JP ! ! , actually in the practice life J will be small, in our case J = 4, they provided four flexible tests based on Lagrange multiplier principle test that it reduced

surprisingly to a test statistic with a simple closed form. It isnt worth nothing if we

using Wald test (WT) and Likelihood ratio (LR) for testing 0 3..4j

JP ! ! .

1.Tests Based on the Third and the Fourth MomentsStengos and Wu (2004) claimed that it can test the distributions maximum

entropy density via using the third and the fourth moments as follows:

4

1 0

1

( , ) exp( )jj

j

f x xP P P

!

!

Clearly ),(1

Pxf is integerable over the real line until the dominant term , 4x , in the

exponent must be an even function, otherwise ),(1Pxf will explode as gpx , the

second condition is that the coefficient associated with the dominant term, which is an


55/89

even function by the first condition, must be positive, otherwise ),(1Pxf will explode

as gpx . For testing the normality of the sample it is requited to test:

: 0o j

H P ! . .V S 1 : j jH P P!)

3,4j !

WhereP

denotes the maximum entropy estimator ofP .

a) According to Stengos and Wu (2004), for running the Lagrange multiplier test it is

required to obtain the score function and the information matrix under the null

hypothesis, the sample follows the standard normal, as follows:

1( )S P

d !

4

0

11

ln exp( )n

j

j i

ji

j

d x

d

P P

P

!!

4

0

1 1

( )n

j

j i

i j

j

d x

d

P P

P

! !

!

4

1{ }

o j jj

j

d m

nd

P P

P

!

! }{ jj

o md

dn

!

P

P

4

1

ln( exp( ) )

{ )

j

j

j

j

j

d x dx

n md

P

P

g

!g

!

1( )S P

d ! }))({})({})({})({( 4

4

3

3

2

2

1 mxEmxEmxEmxEn oooo dddd!

Where )( jo xEd refers to the expected of thejx under the standard normal of the

sample, using (3.1.21) )( jo xEd has the following formula:

g

g

g

g !

d!!d ))5.exp())exp()( 22

1

dxxxdxxxxE oNj

j

j

jNoN

jj

oPPP

Where:

g

g

!d ))5.exp(ln( 2 dxxoN

P

Since it is required to test only 43 and PP equal to zero, therefore the score function

under the standard normal can be:

)300()))(())((00()( 4344

3

3

1 mmnmxEmxEnS oo !dd!dP


56/89

The information matrix )(44 PvdI associated to testing the normality can be expressed

as:

(4,4)

2

( ln( ( , )) i

i

12

( ln( ( , )) i

i

1

( )

N

dnE f x

di

dnE f x

d di j

I

PP

PP P

P

!

{

d !

( , ) ( )J J NI P

d { ( ) ( ) ( )}i j j io o o

n E x E x E xd d d! = n

960120

01503

12023

0301

1 , 4i je e

According to Engle (1984) the Lagrange multiplier test takes the following formula:

1

1 1 1 1( ) ( ) ( )t

NLM S I SP P P

d d d d! )24

)3(

6(

2

4

2

3 !mm

n

Nevertheless 1ML dhas a simple closed form, it is testing only whether the sample

follows the standard normal that is required to standardize the sample before

operating the test, fortunately it can derived the 1ML dunder the normal ( Q ,2W ),

where2

and WQ denote the maximum entropy estimators for the normal distributions

parameters. Hence the score function under normal ( Q , 2W ) will be:

)})({})({00{)(4

4

3

3

1 mxEmxEnS oo !P

Where )( jo xE refers to the expected of thejx under normal ( Q , 2W ), using (3.1.21)

it can take:

221

12 2

1 ( ) exp( ) ) exp( )

2

( , )

j j j j

o oN jN oN

j

j

i N

mE x x x dx x x x dx

m m

x f x dx

P P P

P

g g

!g g

g

g

! ! d d

!

)

Where:


57/89

g

gd

d

! ))2

1exp(ln( 2

22

1 dxxm

xm

moNP

The information matrix under normal ( Q , 2W ) can be expressed as:

)}()()({)( 1j

o

i

o

ji

oN xExExEnI !

P

Finally the Lagrange multiplier test will be:

1

1 1 1 1( ) ( ) ( )tNLM S I SP P P

!

One can observe that 1ML dis a special case of 1LM .

b) For testing 0 3, 4h hP ! ! via Wald test(WT), it is first required to partition

P as:

!

21 PPP where 211

entropy statistics tests goodness of fit wald lagrange likelihood ratio

Documents