cs433 modeling and simulation lecture 05 statistical analysis tools

CS433Modeling and Simulation

Lecture 05

Statistical Analysis Tools

1

Dr. Anis Koubâa

Al-Imam Mohammad Ibn Saud UniversityAl-Imam Mohammad Ibn Saud University

http://10.2.230.10:4040/akoubaa/cs433/

09 Nov 2008

Goals of Today

Know how to compare between two distributions

Know how to evaluate the relationship between two random variable

Outline

Comparing Distributions: Tests for Goodness-of-Fit Chi-Square Distribution (for discrete

models: PMF) Kolmogorov-Smirnov Test (for continuous

models: CDF) Evaluating the relationship

Linear Regression Correlation

Statistical Tests enables to compare between two distributions, also known as Goodness-of-Fit. The goodness-of-fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in questionGoodness-of-fit means how well a statistical model fits a set of observations

Goodness-of-fit التوفيق جودة

The Pearson's chi-square test enables to compare two probability mass functions of two distribution.If the difference value (Error) is greater than the critical value, the two distribution are said to be different or the first distribution does not fit (well) the second distribution.If the difference if smaller that the critical value, the first distribution fits well the second distribution

Pearson’s²-TestsChi-Square Tests for Discrete Models

(Pearson's ) Chi-Square test

Pearson's chi-square is used to assess two types of comparison: tests of goodness of fit: it establishes whether or not an

observed frequency distribution differs from a theoretical distribution.

tests of independence. it assesses whether paired observations on two variables are independent of each other. For example, whether people from different regions differ in the

frequency with which they report that they support a political candidate.

If the chi-square probability is less or equal to 0.05 then we say that both distributions are equal (goodness-of-fit) or that the row variable is unrelated (that is, only randomly related) to the

column variable (test of independence).

Chi-Square Distributionhttp://en.wikipedia.org/wiki/Chi-square_distribution


The chi-square test, in general, can be used to check whether an empirical distribution follows a specific theoretical distribution.

Chi-square is calculated by finding the difference between each observed (O) and theoretical or expected (E) frequency for each possible outcome, squaring them, dividing each by the theoretical frequency, and taking the sum of the results.

For n data outcomes (observations), the chi-square statistic is defined as:

Oi = an observed frequency for a given outcome;

Ei = an expected (theoretical) frequency for a given outcome; n = the number of possible outcomes of each event;

22

11

ni i

nii

O E

E


A chi-square probability of 0.05 or less is the criteria to accept or reject the test of difference between the empirical and theoretical distributions.

We say that the observed distribution (empricial) fits well the expected distribution (theoretical) if:

• (k – 1 – c) is the degree of freedom, where k is the number of possible outcome and c is the number of estimated parameters

• 1- is the confidence level (basically, we use = 0.05)

Chi-Square test: General Algorithm

2 2*0 , 1

2 2*0 , 1

2* 2*, 1

which means 1

where

idfChiSquare 1 ,

k c

k c

critical k c

p

k c

http://en.wikipedia.org/wiki/Inverse-chi-square_distribution

Chi-Square test: Example

Uniform distribution in [0 .. 9]

PASS

In statistics, the Kolmogorov–Smirnov test (K–S test) quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the expected distribution, or between the empirical distribution functions of two samples. It can be used for both continuous and discrete models Basic idea: compute the maximum distance between two cumulative distribution functions and compare it to critical value.

If the maximum distance is smaller than the critical value, the first distribution fits the second distribution

If the maximum distance is greater than the critical value, the first distribution does not fit the second distribution

(KS-Test) Kolmogorov – Smirnov Testfor Continuous Models

Kolmogorov – Smirnov test

In statistics, the Kolmogorov – Smirnov test is used to determine whether two one-dimensional probability distributions

differ, or whether an probability distribution differs from a

hypothesized distribution,

in either case based on finite samples. The Kolmogorov-Smirnov test statistic measures the

largest vertical distance between an empirical cdf calculated from a data set and a theoretical cdf.

The one-sample KS-test compares the empirical distribution function with a cumulative distribution function.

The main applications are testing goodness-of-fit with the normal and uniform distributions.

Kolmogorov–Smirnov Statistic

Let X1, X2, …, Xn be iid random variables in with the CDF equal to F(x).

The empirical distribution function Fn(x) based on sample X1, X2, …, Xn is a step function defined by:

where I(A) is the indicator of event A.

The Kolmogorov-Smirnov test statistic for a given function F(x) is

1

number of element in the sample 1 n

n ii

xF x I X x

n n

1 if

0 otherwisei

i

X xI X x

supn nx

D F x F x

Kolmogorov–Smirnov Statistic

The Kolmogorov-Smirnov test statistic for a given function F(x) is

Facts

By the Glivenko-Cantelli theorem, if the sample comes from a distribution F(x), then Dn converges to 0 almost surely.

In other words, If X1, X2, …, Xn really come from the distribution with CDF F(X), the distance Dn should be small

supn nx

D F x F x

Dmax

Example

Example: Grade Distribution? We would like to know the distribution of

the Grades of students. First, determine the empirical distribution Second, compare to Normal and Poisson

distributions Data Sample: 50 Grades in a course

and computed the empirical distribution Mean = 63 Standard Deviation = 15

Example: Grade Distribution?

Frequency = Number of grades grade gradeX X

Frequency Empirical Distribution = F = =

Sample Size

gradegrade grade

XX p X X

supn nx

D F x F x

Example: Grade Distribution?

Dmax,Poisson= 0.153

supn nx

D F x F x

Dmax,Normal= 0.119

Kolmogorov–Smirnov Acceptance Criteria

Rejection Criteria: We consider that the two distributions are not equal if the empirical CDF is too far from the theoritical CDF of the proposed distribution

This means: We reject if Dn is too large. But the question is: What does large mean?

For which values of Dn should we accept the

distribution?

In the 1930’s, Kolmogorov and Smirnov showed that

So, for large sample sizes, you could assume

level test: find the value of t such .

So, the test is accepted if

Kolmogorov–Smirnov testhttp://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test

2 2-1 -2

1

lim 1- 2 (-1) i i tn

ni

P n D t e

2 2-1 -2

1

1- 2 (-1) i i tn

i

P n D t e

nt

Dn

2 2-1 -2

1

2 (-1) i i t

i

e

Critical value

For small samples, people have worked out and tabulated critical values, but there is no nice closed form solution.

• J. Pomeranz (1973)

• J . Durbin (1968)

n

1.6276

n

1.5174

n

1.3581

n

1.2239

n

1.0730 valuecritical

0.01 0.02 0.05 0.10 0.20

For Large Samples: Good approximations for n>40:

Kolmogorov–Smirnov test

Example: Grade Distribution? For our example, we have n = 50 The critical value for a = 0.05

1.35810.192

50criticalD

max, 0.119 0.192Normal criticalD D

max, 0.153 0.192Poisson criticalD D

ACCEPT

ACCEPT

Example: Grade Distribution? If we get the same distribution for n =

100 The critical value for a = 0.051.3581

0.1358100

criticalD

max, 0.119 0.1358Normal criticalD D

max, 0.153 0.1358Poisson criticalD D

ACCEPT

REJECT

In statistics, linear regression is a form of regression analysis in which the relationship between one or more independent variables and another variable, called dependent variable, is modeled by a least squares function, called linear regression equation. This function is a linear combination of one or more model parameters, called regression coefficients.

A linear regression equation with one independent variable represents a straight line. The results are subject to statistical analysis.

Linear Regression: Least Square Methodhttp://en.wikipedia.org/wiki/Linear_regression

The Method of Least Squares

The equation of the best-fitting line is calculated using a set of n pairs (xi, yi).

We choose our estimates a and b to estimate a and b so that the vertical distances of the points from the line, are minimized.

22 )()ˆ(

ˆ

bxayyy

ba

bxay

SSE

minimize to and Choose

:line fitting Best

SSE: Sum of Square of Errors

Least Squares Estimators

xbyaS

Sb

bxayn

yxxy

n

yy

n

xx

xx

xy

xy

yyxx

and

where :line fitting Best

S

S S

:squares of sums the Calculate

ˆ

))((

)()( 22

22

xbyaS

Sb

bxayn

yxxy

n

yy

n

xx

xx

xy

xy

yyxx

and

where :line fitting Best

S

S S

:squares of sums the Calculate

ˆ

))((

)()( 22

22

Example

The table shows the math achievement test scores for a random sample of n = 10 college freshmen, along with their final calculus grades. Student 1 2 3 4 5 6 7 8 9 10

Math test, x 39 43 21 64 57 47 28 75 34 52

Calculus grade, y 65 78 52 82 92 89 73 98 56 75

Use your calculator to find the sums and sums of squares.


7646

36854

5981623634

76046022

yx

xy

yx

yx

7646

36854

5981623634

76046022

yx

xy

yx

yx

:line fitting Best

and .76556

S

2056S

2474S

xy

ab

xy

yy

xx

77.78.40ˆ

78.40)46(76556.762474

1894

189410

)760)(460(36854

10

)760(59816

10

)460(23634

2

2

:line fitting Best

and .76556

S

2056S

2474S

xy

ab

xy

yy

xx

77.78.40ˆ

78.40)46(76556.762474

1894

189410

)760)(460(36854

10

)760(59816

10

)460(23634

2

2

Example

In probability theory and statistics, correlation (often measured as a correlation coefficient) indicates the strength and direction of a linear relationship between two random variables.

In general statistical usage, correlation or co-relation refers to the departure of two variables from independence. In this broad sense there are several coefficients, measuring the degree of correlation, adapted to the nature of the data.

Correlation Analysis

Correlation Analysis

• The strength of the relationship between x and y is measured using the coefficient of correlationcoefficient of correlation:

:tcoefficien nCorrelatioyyxx

xy

SS

Sr :tcoefficien nCorrelatio

yyxx

xy

SS

Sr

• The sign of r indicates the direction of the relationship;

• r near 0 indicates no linear relationship,

• r near 1 or -1 indicates a strong linear relationship.

• A test of the significance of the correlation coefficient is identical to the test of the slope .

Example

The table shows the heights and weights ofn = 10 randomly selected college football players.

Player 1 2 3 4 5 6 7 8 9 10

Height, x 73 71 75 72 72 75 67 69 71 69

Weight, y 185 175 200 210 190 195 150 170 180 175



8261.)2610)(4.60(

328

26104.60328

r

SSS yyxxxy

8261.)2610)(4.60(

328

26104.60328

r

SSS yyxxxy

Football Players

Height

Weig

ht

75747372717069686766

210

200

190

180

170

160

150

Scatterplot of Weight vs Height

r = .8261

Strong positive correlation

As the player’s height increases, so does his

weight.

r = .8261

Strong positive correlation

As the player’s height increases, so does his

weight.

Some Correlation Patterns • Use the Exploring CorrelationExploring Correlation applet to

explore some correlation patterns:r = 0; No correlationr = 0; No correlation

r = .931; Strong positive correlation

r = .931; Strong positive correlation

r = 1; Linear relationship

r = 1; Linear relationship

r = -.67; Weaker negative correlation

r = -.67; Weaker negative correlation

APPLETAPPLETMY

cs433 modeling and simulation lecture 05 statistical analysis tools

Documents

distribution fits

empirical distribution

chisquare distributionhttp

chisquare statistic

fitchisquare distribution

expected theoretical

theoretical distributions

tests of goodness