session iii. introduction to probability and testing for goodness of fit:

Session III.Introduction to Probability and Testing for Goodness of Fit:

Or An IDEA about Probability and Testing!

(Zar: Chapters 5, 22)

What is Probability?

(1) Not defined much like a point was not defined in geometry; but -------

(2) Probability is a measure of the “chance” that an “event” will happen.

(a) “What’s the probability it will rain?”

(b) “What’s the probability that the coin will be ‘heads’ when I flip it?”

Subjective

Objective

(3) Measured between 0 - 100% or between 0-1.

If “A” is the event, then P(A) is the probability of A.

How do you get a Value for P(A)?

Estimate of P(A) is

ˆ P (A) # of times "A" happens

# of total tries

x1

0if miss (not A)

1if hit (A)

x2 0 or 1

x3 0 or 1

xn 0 or 1

ˆ P (A)

x1 x2 xn

nP(A)

What are “Odds”?

O(A)

#of Times "A" Happens#of Times "A" doesn't Happen

1 2

1 2

n

n

i

i

x x x

n x x x

x

n x

P̂( )

P̂(~ )

A

A

Events

Disjoint or Mutually ExclusiveUniverse

AB

IndependentWhen one event does not effect another

Ex: Coin flips

Random selection from an infinite set or selection with replacement from a finite set.

DependentWhen one event effects another

Ex: Checker flipsRandom selection from a finite set

Ex: Colored balls in a bag

Joint outcomesIndependent: Multiply!

P( , ) P( ) P( )

1 1 1

2 2 4

H H H H

First TossH T

H (H,H) (H,T)Second TossT (T,H) (T,T)

What is the chance of each outcome?

If Mutually Exclusive? Add!

1 1 1P ((H,H) + (T,T)) =

4 4 21

P (1 head)= P ((H,T)+(T,H))=2

3P (at least 1 head)= P (1 head)+ P (2 heads)=

4

1st & 2nd toss are independent!The outcomes (H,H), (H,T), (T,H), and (T,T) aremutually exclusive!

A Simple Hypothesis

The “Binomial” --Any experiment with just two outcomes

EX: Flower color Suppose only yellow or green flowers.

(Y Y) (g g)

1st cross: (Y g) (Y g)

2nd cross: (Y Y) (Y g) (g Y) (g g )

Probability: 1/4 1/4 1/4 1/4

Hypothesis: Yellow is Dominant

Result: Y Y Y g

Probabiliy: 3/4 1/4

THE EXPERIMENT:

Select 100 flowers at random:

Yellow GreenResult: 84 16

Expected: 75 25 (3/4100) (1/4100)

Under H0:

The Problem:

Is 84,16 consistent with the hypothesis?Does 84,16 support a probability of 75%, 25%?

Answer: In the form of a question:

What’s the probability that 84,16 could come from a true population of 75%, 25%?

Ho: p =75%

The Binomial Distribution:

The Binomial Distribution: Take in Slow!

n =1 P(Y) = .75 P(g) =.25

n = 2P(YY)=P(Y)P(Y) = .75 .75 = .5625P(Yg) = P(g Y) = P(Y) P(g) = .75 .25 = .1875P(gg) = P(g) P(g) = .25 .25 = .0625

BUT…..

P(YY) + P(Yg) + P(gg) = .8125 ≠1

What’s wrong?We have the possibility of both Yg + gY)!

Should be P(Y Y) + P (Y g) + P (g Y) + P (g g) 1

Which is # Probability(1) P(YY) = .252

(2) P(Yg) = P(gY) = .75 x .25(1) P(gg) = .252

Ex: P(at least one Y) = P(YY) + P(gY) + P(Yg) = .75

n = 3

(1) P (Y Y Y) = .753

(3) P (Y Y g) = P(YgY) + P(gYY) = .752 .25

(3) P(Y g g) = P(g Y g) = P(g g Y) = .75 .252

(1) P(g g g ) = .253

The Binomial distribution (cont.)

In general: for x= 0, 1, …, n

P(xY’s, (n-x)g’s) = #ways (xY’s; (n-x)g’s) P(Y Y … Y g g … g)

x n-x

Where

1n xxn

p px

!

! !

! 1 2 1

n n

x x n x

n n n n

1 1 1

1 2 1 1 3 3 1 1 4 6 4 1

1 5 10 10 5 1 1 6 15 20 15 6 1 1 7 21 35 35 21 7 1

x

n By Pascal’s Triangle:

np

qpxnx

nnp

qpxnx

nx

qpx

nx

pnxxx

xnxn

x

xnxn

x

xnxn

x

n

x

)1(111

01

1

0

0

))!1(1()!1(

)!1(

)!(!

!

),;Bin(E

Calculate the Expectation (mean) of a binomial:

17 0.01651564 0.03762627 18 0.02538515 0.06301142 19 0.03651899 0.09953041 20 0.04930064 0.14883105 21 0.06260399 0.21143505 22 0.07493508 0.28637013 23 0.08470922 0.37107936 24 0.09059180 0.46167117 25 0.09179969 0.55347085 26 0.08826894 0.64173979

27 0.08064076 0.72238052 28 0.07008065 0.79246116 29 0.05799779 0.85045892 30 0.04575381 0.89621270 31 0.03443835 0.93065107 32 0.02475256 0.95540363 33 0.01700176 0.97240537

x P(x g’s) P(≤ x g’s) 0 0.00000000 0.00000000 1 0.00000000 0.00000000 2 0.00000000 0.00000000 3 0.00000000 0.00000000 4 0.00000002 0.00000002 5 0.00000010 0.00000012 6 0.00000052 0.00000064 7 0.00000235 0.00000299 8 0.00000910 0.00001209 9 0.00003100 0.00004308 10 0.00009402 0.00013710 11 0.00025642 0.00039352 12 0.00063392 0.00102744 13 0.00143038 0.00245782 14 0.00296294 0.00542076 15 0.00566251 0.01108327 16 0.01002735 0.02111062

For n=100:

Conclusions:

Chance of (84,16) = chance of getting (84,16) or anything “rarer” than (84,16)

= P(84,16) +P(85,15) + P(86,14) +…+P(100,0)= .0211

What is rare enough?

Biomedical Convention: .05 or 5%

RULE: If the experiment is rarer than the cutoff level, say that the experiment is not consistent with the hypothesis!

If less rare, say it is consistent!

Other Cutoffs:

Example: the Bruston Explosive Bolt.

Lower:.01 or 1% (1/100)for situations needing a lower error rate

Or higher:

Example: Physiological StudiesExample: Secondary mets. in pediatric Leukemia study

Conclusion:

No one cutoff works in every situation. The cutoff should be set before hand to avoid bias.

.001 or .1% (1/1000)

What is the cutoff? What is the p=value?

Chance [experiment statistic ≤ cutoff] or

Pr[X ≤ x | Ho] ≤ cutoff probability

Value inUniverse

StatisticUnder the Ho

If the probability statement is true, then decide that the experiment is not consistent with the hypothesis.

But there’s still a chance the experiment came from the Ho!

Decide from ExperimentH0 HA

H0 No Error Type I ()Actual Truth

HA Type II () No Error

Type I error = = PR[decide ~ Ho | Ho is true]

Three numbers: cutoff If you have one you have all

given

Type II error = ß = Pr[decide Ho | Ho is not true]

Summary of the Binomial

• Density function:• Distribution

function:

; , ; 1i n inb i n p p q q p

i

1

; ,x

i n i

i

nB x n p p q

i

Mean: np

Variance: npq

nxx i /

xxn

nxxs i

1

or 1/22

Another Way to look at flowers:

H0 : Yellow is dominantHA : Yellow is not dominant

Y g TotalObserved: 84 16 100

Expected percent: 75% 25%

Expected number: 75 25 100(n*proportion)

chi-square: (84-75)2 (16-25)2

75 25

22#terms

1

(observed - expected)

expectedi

Degrees of freedom=#terms-1

2 (84 75)2

75

(16 25)2

25

1.083.244.32

84, 16 Example:Degrees of freedom = 2-1 =1

So how extreme is 4.32?

X

109876543210

.8

.7

.6

.5

.4

.3

.2

.1

0.0

X

109876543210

1.0

.9

.8

.7

.6

.5

.4

.3

.2

.1

0.0

Support H0

Cutoff

Reject H0

X

8.07.57.06.56.05.55.04.54.03.53.0

.030

.025

.020

.015

.010

.005

0.000

X

8.07.57.06.56.05.55.04.54.03.53.0

1.00

.99

.98

.97

.96

.95

.94

.93

.92

Table B.1p-value x

0.01 6.635

0.025 5.024

0.05 3.841

4.32

Table B.1DF p-value x 1 0.05 3.841 4 0.05 9.48810 0.05 18.30720 0.05 31.410

Another Example:More than 2 groups.

Color: Green & YellowTexture: Smooth & Wrinkled

Hypothesis: Y is dominantS is dominantColor and Texture are independent

Pr(any cell)=1/16

Color

T (Y Y) (g Y) (Y g) (g g)

e (S S) (YY,SS) (gY,SS) (Yg, SS) (gg,SS)

x (Sw) (YY,Sw) (gY,Sw) (Yg,Sw) (gg,SS)

t (W S) (YY,wS) (gY,wS) (Yg,wS) (gg,wS)

u (ww) (YY,ww) (gY,ww) (Yg,wS) (gg,ww)

r

e

Ho : 9 : 3 : 3 : 1 16

YS Yw gS gw Total

Obs: 152 39 53 6 250

Pr(H0): 9/16 3/16 3/16 1/16

(0.5625) (0.1875) (0.1875) (0.0625)

Expected: 140.625 46.875 46.875 15.625

2: (152-140.625)2 (39-46.875)2 (53-46.875)2 (6-15.625)2

140.625 46.875 46.875 15.625

0.9201 + 1.3230 + 0.8003 + 5.929 = 8.97

DF: 1 + 1 + 1 + 1 - 1 = 3

2.05 0(3) 7.815 8.97; (p=.0293<0.05) reject H

So, where’s the Difference or Subdividing the H0

(1) too few gw

(2) about the right # of the others

combine YS + Yw + gS and compare to gw.But first, test (2):

H0 : YS, Yw, gS in 9 : 3 : 3Total

Obs: 152 39 53 244Pr(H0 ) : 9/15 = .6 3/15 = .2 3/15 = .2Exp: 146.4 48.8 48.82 : .2142 + 1.968 + 0.3615 = 2.544 D.F = 1 + 1 + 1 - 1= 2

205 (2) = 5.991 Accept H0

H0 : Others vs gw

15 : 1 TotalObs: 244 6 250

Pr(H0): 15 = .9375 1 = .062516 16

Expected: 234.375 15.625

2: 0.3953 + 5.929 = 6.324 DF : 1 + 1 -1 = 1

2.05 (1) = 3.841 Reject H0 and accept “Too few gw”

Summary:

2

0

2

1

ˆ( )ˆ

observation

ˆ expected value (under H )

i

thi

thi

ki i

i

f f

f

f i

f i

DF = k-1

Others:

(1) Continuity Correction

(2) Rule of Thumb to use 2 instead of Binomial:

If no more than 25% of the

And none ≤ 1, then use 2.

ˆ 5if

(3) Log-Likelihood Ratio

Yates correction

1

1 1

2 lnˆ

ˆ2 ln ln

ki

ii

i

k k

i i i ii i

fG f

f

f f f f

Entropy

Information

22

1

ˆ 0.5

ˆ

k i i

ci i

f f

f

(4) Heterogeneity Chi-SquareThere is often the need to combine chi-square analyses: the common cause is a batch effect where only a certain number of subjects (e.g., cages, school classes, laboratories, or clinics). There is a common hypothesis over all batches (e.g., gender, ethnicity, presence/absence of marker).

(a) Perform chi-square on each “batch”.

(b) Pool all batches and do a “pooled chi-square”.

(c) Sum the individual chi-squares (d.f.= sum of the individual batch chi-squares= k batches times the df for each batch)

(d) Subtract the pooled chi-square from the sum and test with (k-1)* individual batch df. This is the heterogeneity chi-square.

Ex 22.5: heterogeneity chi-square analysis. G. Mendal 1933, 32Experiment Ye llow seeds Green seeds Total

seeds (n)Chi-square DF

1 25 11 36 0.5926 1(27.0000) (9.0000)

2 32 7 39 1.0342 1(29.2500) (9.7500)

3 14 5 19 0.0175 1(14.2500) (4.7500)

4 70 27 97 0.4158 1(72.7500) (24.2500)

5 24 13 37 2.0270 1(27.7500) (9.2500)

6 20 6 26 0.0513 1(19.5000) (6.5000)

7 32 13 45 0.3630 1(33.7500) (11.2500)

8 44 9 53 1.8176 1(39.7500) (13.2500)

9 50 14 64 0.3333 1(48.0000) (16.0000)

10 44 18 62 0.5376 1(46.5000) (15.5000)

Total ofchi-squaresChi-squares

7.1899 10

(i.e., pooled 355 123 478 0.1367 1chi-squared (358.5000) (119.5000)

7.0532 9(0.50<P<0.75Difference total-pooled:

Ex 22.6: Heterogeneity Chi-SquareSample Right-

handedLeft-handed

N Chi-square DF

1 3 11 14 4.5714* 1(7.0000) (7.0000)

2 4 12 16 4.0000* 1(8.00000 (8.0000)

3 5 15 20 5.0000* 1(10.0000) (10.0000)

4 14 4 18 5.5556* 1(9.0000) (9.0000)

5 13 4 17 4.7647* 1(8.5000) (8.5000)

6 17 5 22 6.5455* 1(11.0000) (11.0000)

*Statisticalysignificant..Total of chi-squares

30.5372 6

Chi-squareof totals(i.e. pooled

56 51 107 0.2336 1

chi-square) (53.5000) (53.5000)Heterogeneity chi-square

30.2036*P< 0.001

5

Difference total-pooled:

II. SC Exchanges in Lymphocytes

Table 4.Distribution of exchanges between chromosomesChromosome Total Relative Proportional Observed

Length Length Exchanges

1 18.16 .08712 2662 16.90 .08107 2963 14.20 .06812 186

4-5 25.36 .12165 4426-12-X 78.06 .37446 88813-15 21.00 .10074 25516-18 18,28 .08769 12519-20 8.50 .04078 26

21-22-Y 8..00 .03838 23Total 208.46 1.00001 2507

TEST:Ho: Exchanges are proportional to length of chromosome

Problem Set 1:

session iii. introduction to probability and testing for goodness of fit:

Documents

y y y g g y g g probability

p y y y

p y y g

pg g y

pg y g

py y p y g p g y p g

y y y gthe experiment

pg g g