statistical methods (pdf, 156 kb)

Statistical Methods

”Never trust a statistics you didn’t forge yourself”

Winston Churchill

Florian Herzog

2013

Independent and identical distributed random variables

Definition 1. The random variables X1, ..., Xn are called a random sample

of size n from the population f(x) if X1, ..., Xn are mutually independent

random variables and the marginal pdf of each Xi is the same function Xi.

Alternatively, X1, ..., Xn are called independent and identically distributed

random variables with pdf f(x).

The joint pdf of X1, ..., Xn is given as:

f(x1, x2, ..., xn) = f(x1)f(x2) · · · f(xn) =

n∏i=1

f(xi)

The slides of this section follow closely the Chapters 5 and 7 of the Book

”G.Casela and R. Berger, Statistical Inference,Duxbury Press 2002”.

Stochastic Systems, 2013 2

Identically and independently distributed random variables

Often in statistics (especially in estimation) we assume identically and inde-

pendently distributed (i.i.d) random variables (r.v.). This means that a random

variable Xi, where k = 1, 2, ... denotes the realizations of the r.v., has the

following properties:

• Each Xk f(x) is drawn form the same density .

• Xk is independent of Xk−1, Xk−2, ..., X1.

• Each Xk is uncorrelated from each Xj,i.e. E[XkXj] = 0 ∀j\k


Sample mean and variance

Definition 2. The sample mean is the arithmetic average of the values in a

random sample and is denoted as

X =1

n

n∑i=1

Xi

Definition 3. The sample variance is the statistic defined as

S2

=1

n− 1

n∑i=1

(Xi −X)2

The sample standard deviation is the statistic defined as S =√

(S2)


Properties of i.i.d.random variables

Theorem 1. When we have X1, X2, ...Xn independent and identically

distributed (i.i.d) random variables with mean µ = E[Xn] and variance

σ2 = V ar[Xn]. Then

E[X] = µ

V ar[X] =σ2

n

E[S2] = σ

2


Properties of i.i.d.random variables

Theorem 2. When we have X1, X2, ...Xn i.i.d. from a normal distribution

with mean µ and variance σ. Then

1. X and S2 are independent random variable,

2. X is distributed N(µ, σ2

n ),

3. (n− 1)S2

σ2 has a chi square distribution with n− 1 degrees of freedom.


Convergence of a sequence of r.v. Xn

1. Convergence with probability one (or almost sure), Xn1−→ X:

P (ω ∈ Ω : limn→∞

(Xn(ω)) = X(ω)) = 1.

2. The sequence Xn converges to X in probability, Xnp−→ X, if

limn→∞

(P (ω ∈ Ω : |Xn(ω)−X(ω)| > ε)

)= 0, for all ε > 0.

3. The sequence Xn converges to X in Lp, XnLp−→ X, if

limn→∞

(E(|Xn(ω)−X(ω)|p)

)= 0.


Convergence concepts interrelations

Xn1−→ X (almost sure) Xn

Lq−→ X, q < p

XnLp−→ X

Xnp−→ X (in probability)

?

q )

Xnd−→ X (in distribution)

?


Parameter estimation (Point estimation)

In stochastic systems modeling, we often build models from data observation

(and not from physical first principles).

We need statistically motivated methods to identify the stochastic systems

under consideration. The identification of the stochastic systems requires the

following:

• Identification of the distribution

• Identification of the dynamics

• Identification of the system parameters

• Analysis of the parameter significance

In this section we only focus on the parameter estimation and assume that

the distribution is known. We will come back to this topic after the theoretical

introduction of stochastic processes.


Parameter estimation (Point estimation)

Definition 1. A point estimator is any function W (X1, X2, ..., Xn) of a

sample of random variables.

There are main ways of finding point estimators, the main ones are:

• Methods of moments (MM)

• Maximum Likelihood estimators (MLE)

• Expectation Maximization (EM)

• Bayes Estimators

Besides the methods of finding a point estimator, the evaluation (quality) of the

estimator. In the following slides, we will introduce the methods of moments

and the maximum likelihood estimator.


Methods of moments

We have X1, X2, ...., Xn the sample from a population from one pdf

f(x|θ1, θ2, ..., θk). The parameter θi are the distribution parameter, e.g.

σ and µ in the case of a normal distribution.

Definition 2. The method of moments is the matching of the first k moments

of the data with the first k theoretical moments of the distribution. The

theoretical moments are a function of the parameters and parameter estimation

problem is reduced to the solving of k equations


Methods of moments

We have

m1 =1

n

n∑i=1

Xi, µ′1 = E[X],

m2 =1

n

n∑i=1

X2i , µ

′2 = E[X

2],

...

mk =1

n

n∑i=1

Xki , µ

′k = E[X

k],

where mi denotes the sample moment and µ′i the theoretical moments.


Methods of moments

Since µ′i is a function of θi, we get the following system of equations:

m1 = µ′1(θ1, ..., θk),

m2 = µ′2(θ1, ..., θk),

...

mk = µ′k(θ1, ..., θk),

where mi denotes the sample moment and µ′i the theoretical moments.

The parameters are found by solving the system of k moments.


Methods of moments

As main example, we assume that the data is generated by a normal distribution

with mean µ and variance σ2. We denote θ1 = µ and θ2 = σ2. The first and

second moment of the normal distribution are given as

µ′1 = µ =

1

n

n∑i=1

Xi

µ′2 = µ

2+ σ

2=

1

n

n∑i=1

X2i


Methods of moments

Solving for µ and σ2 we get:

µ =1

n

n∑i=1

Xi

σ2

=1

n

n∑i=1

X2i −

(n∑i=1

Xi

)2

=1

n

n∑i=1

(Xi − µ)2

The solution are the sample moments of mean and variance and are of course

the natural”way of estimation the mean and variance of the normal distribution.


Maximum Likelihood Estimation

The likelihood is the joint pdf of X1, ..., Xn and given as:

L(θ1, ...θk|x1, x2, ..., xn) =

n∏i=1

f(xi|θ1, ...θk) .

We denote by x = [x1, x2, ...]T and by θ = [θ1, θ2, ...]

T

Definition 3. For each sample x, let θ(x) be a parameter value at which

L(θ|x) attains its maximum as function of θ. A maximum likelihood estimator

MLE of the parameter θ based on the X is θ(x).

If the likelihood function is C2, then possible candidates for the MLE are the

values of θ which solve

∂

∂θiL(θ1, ...θk|x1, x2, ..., xn) = 0 .


Maximum log-Likelihood Estimation

Theorem 1. The maximum likelihood estimation is equivalent to the maxi-

mum log-likelihood estimation. The log-likelihood is defined as

l(θ1, ...θk|x1, x2, ..., xn) =

n∑i=1

log (f(xi|θ1, ...θk)) .

Example: We want to derive the maximum likelihood estimated for the mean

(µ) of the normal distribution under the assumption of known variance σ2. The

log-pdf of the normal distribution is given as:

log(f(xi|µ)) = −1

2

(log(2πσ

2) +

(xi − µ)2

σ2

)


Maximum log-Likelihood Estimation

The maximum log-likelihood function is given as:

l(µ|x1, x2, ..., xn) =

n∑i=1

−1

2

(log(2πσ

2) +

(xi − µ)2

σ2

)

Since σ is known, the maximization problem is reduced to least square problem:

minµ

n∑i=1

(xi − µ)2

which as the solution of

µ =1

n

n∑i=1

xi


Invariance of Maximum Likelihood Estimation

Theorem 2. The invariance property of MLEs state that if θ is the MLE of θ

then for any function τ(θ), the MLE of τ(θ) is τ(θ).

Suppose that a distribution is parameterized by a parameter θ, but we are

interest in finding an estimator for some function of θ, say τ(θ), then we can

still use the MLE for θ. An example is as follows: If θ is the mean of normal

distribution, the MLE of sin(θ) is sin(µ).


Quality of estimators: MSE

Definition 4. The mean squared error (MSE) of an W of the a parameter θ

is the function defined by Eθ[(W − θ)2.

The MSE of W measures the average squared distance between the estimator

and the true value of the parameter. The MSE has the following interpretation:

Eθ[(W − θ)2] = V arθ[W ] + (Eθ[W ]− θ)2

The first term is the variance of the estimator W and the second term is called

the bias.

Definition 5. The bias of an estimator W of the parameter θ is the distance

between the expected value of W and the true value of θ. An estimator where

the bias is zero is called an unbiased estimator.


Quality of estimators: Bias and variance

In the multivariate case where θ is a vector of parameters, the variance of the

estimator is a covariance Cov(θ). An estimator with low variance (covariance)

is called an efficient estimator (in the sense that few data is needed). The MSE

is often an trade-off between unbiasedness and higher variance or an biased but

efficient estimator.

The true value of the MSE can often not be determined since the true value of

θ is not known. Therefore, we focus on the variance of the estimator in order

to describe the quality of the estimator.

Definition 6. An estimator is called called consistent when θp→ θ where

p→denotes convergence in probability. An unbiased estimator is also consistent.


Quality of estimators: Normal distribution example

When we have X1, X2, .. i.i.d. data from a N (µ, σ2) distribution and use the

sample mean X and sample variance S2 as estimator:

• E[X] = µ and therefore, X is unbiased

• E[S2] = σ2 and therefore, S2 is unbiased

• E[(X − µ)2] = V ar[X] = σ2

n

• E[(S2 − σ2)2] = V ar[S2] = 2σ4

n−1

The MSE of X is still σ2

n when the data is not normal, but this does not hold

for MSE for S2 when the data is not normally distributed.


Quality of estimators: Cramer-Rao bound for the variance

Definition 7. The Fisher Information matrix J is defined as

Ji,j =1

n

(∂

∂θil(θ1, ...θk|x1, x2, ..., xn)

·∂

∂θjl(θ1, ...θk|x1, x2, ..., xn)

),

which is known as the outer product form. Under certainty regularity conditions

and when the log-likelihood function is C2, it can be calculated as:

Ji,j = −1

n

(∂2

∂θi∂θjl(θ1, ...θk|x1, x2, ..., xn)

),

which is called the inner product form. Note that the expectation is conditional

on θ


Quality of estimators: Cramer-Rao bound for the variance

The Cramer-Rao bound states the following:

Theorem 3. The covariance of an estimator W is bounded by

Cov(W ) ≥J−1

N,

where N is the number of observations- This bound also to make an worst case

approximation of the efficiency of an estimator.

The Fisher Information matrix allows us to compute the uncertainty and thus,

the quality of an estimator.


Quality of estimators

MLE is the main methods for finding estimators, since it has the following

properties:

• Consistency: the estimator converges in probability to the value being

estimated.

• Asymptotic normality: as the sample size increases, the distribution of the

MLE tends to the Gaussian distribution with mean θ and covariance matrix

equal to the inverse of the Fisher information matrix.

• Efficiency, i.e., it achieves the Cramer-Rao lower bound when the sample

size tends to infinity. This means that no asymptotically unbiased estimator

has lower asymptotic mean squared error than the MLE

• The estimate of θ ∼ N (θML,J−1

N ).

• Barretts Theorem ”The maximum-likelihood procedure in any problem is

what you are most likely to do if you don’t know any statistics”.


statistical methods (pdf, 156 kb)

Documents