statistical methods (pdf, 156 kb)

25
Statistical Methods ”Never trust a statistics you didn’t forge yourself” Winston Churchill Florian Herzog 2013

Upload: duongque

Post on 26-Jan-2017

236 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Statistical Methods (PDF, 156 KB)

Statistical Methods

”Never trust a statistics you didn’t forge yourself”

Winston Churchill

Florian Herzog

2013

Page 2: Statistical Methods (PDF, 156 KB)

Independent and identical distributed random variables

Definition 1. The random variables X1, ..., Xn are called a random sample

of size n from the population f(x) if X1, ..., Xn are mutually independent

random variables and the marginal pdf of each Xi is the same function Xi.

Alternatively, X1, ..., Xn are called independent and identically distributed

random variables with pdf f(x).

The joint pdf of X1, ..., Xn is given as:

f(x1, x2, ..., xn) = f(x1)f(x2) · · · f(xn) =

n∏i=1

f(xi)

The slides of this section follow closely the Chapters 5 and 7 of the Book

”G.Casela and R. Berger, Statistical Inference,Duxbury Press 2002”.

Stochastic Systems, 2013 2

Page 3: Statistical Methods (PDF, 156 KB)

Identically and independently distributed random variables

Often in statistics (especially in estimation) we assume identically and inde-

pendently distributed (i.i.d) random variables (r.v.). This means that a random

variable Xi, where k = 1, 2, ... denotes the realizations of the r.v., has the

following properties:

• Each Xk f(x) is drawn form the same density .

• Xk is independent of Xk−1, Xk−2, ..., X1.

• Each Xk is uncorrelated from each Xj,i.e. E[XkXj] = 0 ∀j\k

Stochastic Systems, 2013 3

Page 4: Statistical Methods (PDF, 156 KB)

Sample mean and variance

Definition 2. The sample mean is the arithmetic average of the values in a

random sample and is denoted as

X =1

n

n∑i=1

Xi

Definition 3. The sample variance is the statistic defined as

S2

=1

n− 1

n∑i=1

(Xi −X)2

The sample standard deviation is the statistic defined as S =√

(S2)

Stochastic Systems, 2013 4

Page 5: Statistical Methods (PDF, 156 KB)

Properties of i.i.d.random variables

Theorem 1. When we have X1, X2, ...Xn independent and identically

distributed (i.i.d) random variables with mean µ = E[Xn] and variance

σ2 = V ar[Xn]. Then

E[X] = µ

V ar[X] =σ2

n

E[S2] = σ

2

Stochastic Systems, 2013 5

Page 6: Statistical Methods (PDF, 156 KB)

Properties of i.i.d.random variables

Theorem 2. When we have X1, X2, ...Xn i.i.d. from a normal distribution

with mean µ and variance σ. Then

1. X and S2 are independent random variable,

2. X is distributed N(µ, σ2

n ),

3. (n− 1)S2

σ2 has a chi square distribution with n− 1 degrees of freedom.

Stochastic Systems, 2013 6

Page 7: Statistical Methods (PDF, 156 KB)

Convergence of a sequence of r.v. Xn

1. Convergence with probability one (or almost sure), Xn1−→ X:

P (ω ∈ Ω : limn→∞

(Xn(ω)) = X(ω)) = 1.

2. The sequence Xn converges to X in probability, Xnp−→ X, if

limn→∞

(P (ω ∈ Ω : |Xn(ω)−X(ω)| > ε)

)= 0, for all ε > 0.

3. The sequence Xn converges to X in Lp, XnLp−→ X, if

limn→∞

(E(|Xn(ω)−X(ω)|p)

)= 0.

Stochastic Systems, 2013 7

Page 8: Statistical Methods (PDF, 156 KB)

Convergence concepts interrelations

Xn1−→ X (almost sure) Xn

Lq−→ X, q < p

XnLp−→ X

Xnp−→ X (in probability)

?

q )

Xnd−→ X (in distribution)

?

Stochastic Systems, 2013 8

Page 9: Statistical Methods (PDF, 156 KB)

Parameter estimation (Point estimation)

In stochastic systems modeling, we often build models from data observation

(and not from physical first principles).

We need statistically motivated methods to identify the stochastic systems

under consideration. The identification of the stochastic systems requires the

following:

• Identification of the distribution

• Identification of the dynamics

• Identification of the system parameters

• Analysis of the parameter significance

In this section we only focus on the parameter estimation and assume that

the distribution is known. We will come back to this topic after the theoretical

introduction of stochastic processes.

Stochastic Systems, 2013 9

Page 10: Statistical Methods (PDF, 156 KB)

Parameter estimation (Point estimation)

Definition 1. A point estimator is any function W (X1, X2, ..., Xn) of a

sample of random variables.

There are main ways of finding point estimators, the main ones are:

• Methods of moments (MM)

• Maximum Likelihood estimators (MLE)

• Expectation Maximization (EM)

• Bayes Estimators

Besides the methods of finding a point estimator, the evaluation (quality) of the

estimator. In the following slides, we will introduce the methods of moments

and the maximum likelihood estimator.

Stochastic Systems, 2013 10

Page 11: Statistical Methods (PDF, 156 KB)

Methods of moments

We have X1, X2, ...., Xn the sample from a population from one pdf

f(x|θ1, θ2, ..., θk). The parameter θi are the distribution parameter, e.g.

σ and µ in the case of a normal distribution.

Definition 2. The method of moments is the matching of the first k moments

of the data with the first k theoretical moments of the distribution. The

theoretical moments are a function of the parameters and parameter estimation

problem is reduced to the solving of k equations

Stochastic Systems, 2013 11

Page 12: Statistical Methods (PDF, 156 KB)

Methods of moments

We have

m1 =1

n

n∑i=1

Xi, µ′1 = E[X],

m2 =1

n

n∑i=1

X2i , µ

′2 = E[X

2],

...

mk =1

n

n∑i=1

Xki , µ

′k = E[X

k],

where mi denotes the sample moment and µ′i the theoretical moments.

Stochastic Systems, 2013 12

Page 13: Statistical Methods (PDF, 156 KB)

Methods of moments

Since µ′i is a function of θi, we get the following system of equations:

m1 = µ′1(θ1, ..., θk),

m2 = µ′2(θ1, ..., θk),

...

mk = µ′k(θ1, ..., θk),

where mi denotes the sample moment and µ′i the theoretical moments.

The parameters are found by solving the system of k moments.

Stochastic Systems, 2013 13

Page 14: Statistical Methods (PDF, 156 KB)

Methods of moments

As main example, we assume that the data is generated by a normal distribution

with mean µ and variance σ2. We denote θ1 = µ and θ2 = σ2. The first and

second moment of the normal distribution are given as

µ′1 = µ =

1

n

n∑i=1

Xi

µ′2 = µ

2+ σ

2=

1

n

n∑i=1

X2i

Stochastic Systems, 2013 14

Page 15: Statistical Methods (PDF, 156 KB)

Methods of moments

Solving for µ and σ2 we get:

µ =1

n

n∑i=1

Xi

σ2

=1

n

n∑i=1

X2i −

(n∑i=1

Xi

)2

=1

n

n∑i=1

(Xi − µ)2

The solution are the sample moments of mean and variance and are of course

the natural”way of estimation the mean and variance of the normal distribution.

Stochastic Systems, 2013 15

Page 16: Statistical Methods (PDF, 156 KB)

Maximum Likelihood Estimation

The likelihood is the joint pdf of X1, ..., Xn and given as:

L(θ1, ...θk|x1, x2, ..., xn) =

n∏i=1

f(xi|θ1, ...θk) .

We denote by x = [x1, x2, ...]T and by θ = [θ1, θ2, ...]

T

Definition 3. For each sample x, let θ(x) be a parameter value at which

L(θ|x) attains its maximum as function of θ. A maximum likelihood estimator

MLE of the parameter θ based on the X is θ(x).

If the likelihood function is C2, then possible candidates for the MLE are the

values of θ which solve

∂θiL(θ1, ...θk|x1, x2, ..., xn) = 0 .

Stochastic Systems, 2013 16

Page 17: Statistical Methods (PDF, 156 KB)

Maximum log-Likelihood Estimation

Theorem 1. The maximum likelihood estimation is equivalent to the maxi-

mum log-likelihood estimation. The log-likelihood is defined as

l(θ1, ...θk|x1, x2, ..., xn) =

n∑i=1

log (f(xi|θ1, ...θk)) .

Example: We want to derive the maximum likelihood estimated for the mean

(µ) of the normal distribution under the assumption of known variance σ2. The

log-pdf of the normal distribution is given as:

log(f(xi|µ)) = −1

2

(log(2πσ

2) +

(xi − µ)2

σ2

)

Stochastic Systems, 2013 17

Page 18: Statistical Methods (PDF, 156 KB)

Maximum log-Likelihood Estimation

The maximum log-likelihood function is given as:

l(µ|x1, x2, ..., xn) =

n∑i=1

−1

2

(log(2πσ

2) +

(xi − µ)2

σ2

)

Since σ is known, the maximization problem is reduced to least square problem:

minµ

n∑i=1

(xi − µ)2

which as the solution of

µ =1

n

n∑i=1

xi

Stochastic Systems, 2013 18

Page 19: Statistical Methods (PDF, 156 KB)

Invariance of Maximum Likelihood Estimation

Theorem 2. The invariance property of MLEs state that if θ is the MLE of θ

then for any function τ(θ), the MLE of τ(θ) is τ(θ).

Suppose that a distribution is parameterized by a parameter θ, but we are

interest in finding an estimator for some function of θ, say τ(θ), then we can

still use the MLE for θ. An example is as follows: If θ is the mean of normal

distribution, the MLE of sin(θ) is sin(µ).

Stochastic Systems, 2013 19

Page 20: Statistical Methods (PDF, 156 KB)

Quality of estimators: MSE

Definition 4. The mean squared error (MSE) of an W of the a parameter θ

is the function defined by Eθ[(W − θ)2.

The MSE of W measures the average squared distance between the estimator

and the true value of the parameter. The MSE has the following interpretation:

Eθ[(W − θ)2] = V arθ[W ] + (Eθ[W ]− θ)2

The first term is the variance of the estimator W and the second term is called

the bias.

Definition 5. The bias of an estimator W of the parameter θ is the distance

between the expected value of W and the true value of θ. An estimator where

the bias is zero is called an unbiased estimator.

Stochastic Systems, 2013 20

Page 21: Statistical Methods (PDF, 156 KB)

Quality of estimators: Bias and variance

In the multivariate case where θ is a vector of parameters, the variance of the

estimator is a covariance Cov(θ). An estimator with low variance (covariance)

is called an efficient estimator (in the sense that few data is needed). The MSE

is often an trade-off between unbiasedness and higher variance or an biased but

efficient estimator.

The true value of the MSE can often not be determined since the true value of

θ is not known. Therefore, we focus on the variance of the estimator in order

to describe the quality of the estimator.

Definition 6. An estimator is called called consistent when θp→ θ where

p→denotes convergence in probability. An unbiased estimator is also consistent.

Stochastic Systems, 2013 21

Page 22: Statistical Methods (PDF, 156 KB)

Quality of estimators: Normal distribution example

When we have X1, X2, .. i.i.d. data from a N (µ, σ2) distribution and use the

sample mean X and sample variance S2 as estimator:

• E[X] = µ and therefore, X is unbiased

• E[S2] = σ2 and therefore, S2 is unbiased

• E[(X − µ)2] = V ar[X] = σ2

n

• E[(S2 − σ2)2] = V ar[S2] = 2σ4

n−1

The MSE of X is still σ2

n when the data is not normal, but this does not hold

for MSE for S2 when the data is not normally distributed.

Stochastic Systems, 2013 22

Page 23: Statistical Methods (PDF, 156 KB)

Quality of estimators: Cramer-Rao bound for the variance

Definition 7. The Fisher Information matrix J is defined as

Ji,j =1

n

(∂

∂θil(θ1, ...θk|x1, x2, ..., xn)

·∂

∂θjl(θ1, ...θk|x1, x2, ..., xn)

),

which is known as the outer product form. Under certainty regularity conditions

and when the log-likelihood function is C2, it can be calculated as:

Ji,j = −1

n

(∂2

∂θi∂θjl(θ1, ...θk|x1, x2, ..., xn)

),

which is called the inner product form. Note that the expectation is conditional

on θ

Stochastic Systems, 2013 23

Page 24: Statistical Methods (PDF, 156 KB)

Quality of estimators: Cramer-Rao bound for the variance

The Cramer-Rao bound states the following:

Theorem 3. The covariance of an estimator W is bounded by

Cov(W ) ≥J−1

N,

where N is the number of observations- This bound also to make an worst case

approximation of the efficiency of an estimator.

The Fisher Information matrix allows us to compute the uncertainty and thus,

the quality of an estimator.

Stochastic Systems, 2013 24

Page 25: Statistical Methods (PDF, 156 KB)

Quality of estimators

MLE is the main methods for finding estimators, since it has the following

properties:

• Consistency: the estimator converges in probability to the value being

estimated.

• Asymptotic normality: as the sample size increases, the distribution of the

MLE tends to the Gaussian distribution with mean θ and covariance matrix

equal to the inverse of the Fisher information matrix.

• Efficiency, i.e., it achieves the Cramer-Rao lower bound when the sample

size tends to infinity. This means that no asymptotically unbiased estimator

has lower asymptotic mean squared error than the MLE

• The estimate of θ ∼ N (θML,J−1

N ).

• Barretts Theorem ”The maximum-likelihood procedure in any problem is

what you are most likely to do if you don’t know any statistics”.

Stochastic Systems, 2013 25