session iii. introduction to probability and testing for goodness of fit:
DESCRIPTION
Session III. Introduction to Probability and Testing for Goodness of Fit: Or An IDEA about Probability and Testing! (Zar: Chapters 5, 22). What is Probability?. (1) Not defined much like a point was not defined in geometry; but -------. (2) Probability is a measure of the “chance” - PowerPoint PPT PresentationTRANSCRIPT
Session III.Introduction to Probability and Testing for Goodness of Fit:
Or An IDEA about Probability and Testing!
(Zar: Chapters 5, 22)
What is Probability?
(1) Not defined much like a point was not defined in geometry; but -------
(2) Probability is a measure of the “chance” that an “event” will happen.
(a) “What’s the probability it will rain?”
(b) “What’s the probability that the coin will be ‘heads’ when I flip it?”
Subjective
Objective
(3) Measured between 0 - 100% or between 0-1.
If “A” is the event, then P(A) is the probability of A.
How do you get a Value for P(A)?
Estimate of P(A) is
ˆ P (A) # of times "A" happens
# of total tries
x1
0if miss (not A)
1if hit (A)
x2 0 or 1
x3 0 or 1
xn 0 or 1
ˆ P (A)
x1 x2 xn
nP(A)
What are “Odds”?
O(A)
#of Times "A" Happens#of Times "A" doesn't Happen
1 2
1 2
n
n
i
i
x x x
n x x x
x
n x
P̂( )
P̂(~ )
A
A
Events
Disjoint or Mutually ExclusiveUniverse
AB
IndependentWhen one event does not effect another
Ex: Coin flips
Random selection from an infinite set or selection with replacement from a finite set.
DependentWhen one event effects another
Ex: Checker flipsRandom selection from a finite set
Ex: Colored balls in a bag
Joint outcomesIndependent: Multiply!
P( , ) P( ) P( )
1 1 1
2 2 4
H H H H
First TossH T
H (H,H) (H,T)Second TossT (T,H) (T,T)
What is the chance of each outcome?
If Mutually Exclusive? Add!
1 1 1P ((H,H) + (T,T)) =
4 4 21
P (1 head)= P ((H,T)+(T,H))=2
3P (at least 1 head)= P (1 head)+ P (2 heads)=
4
1st & 2nd toss are independent!The outcomes (H,H), (H,T), (T,H), and (T,T) aremutually exclusive!
A Simple Hypothesis
The “Binomial” --Any experiment with just two outcomes
EX: Flower color Suppose only yellow or green flowers.
(Y Y) (g g)
1st cross: (Y g) (Y g)
2nd cross: (Y Y) (Y g) (g Y) (g g )
Probability: 1/4 1/4 1/4 1/4
Hypothesis: Yellow is Dominant
Result: Y Y Y g
Probabiliy: 3/4 1/4
THE EXPERIMENT:
Select 100 flowers at random:
Yellow GreenResult: 84 16
Expected: 75 25 (3/4100) (1/4100)
Under H0:
The Problem:
Is 84,16 consistent with the hypothesis?Does 84,16 support a probability of 75%, 25%?
Answer: In the form of a question:
What’s the probability that 84,16 could come from a true population of 75%, 25%?
Ho: p =75%
The Binomial Distribution:
The Binomial Distribution: Take in Slow!
n =1 P(Y) = .75 P(g) =.25
n = 2P(YY)=P(Y)P(Y) = .75 .75 = .5625P(Yg) = P(g Y) = P(Y) P(g) = .75 .25 = .1875P(gg) = P(g) P(g) = .25 .25 = .0625
BUT…..
P(YY) + P(Yg) + P(gg) = .8125 ≠1
What’s wrong?We have the possibility of both Yg + gY)!
Should be P(Y Y) + P (Y g) + P (g Y) + P (g g) 1
Which is # Probability(1) P(YY) = .252
(2) P(Yg) = P(gY) = .75 x .25(1) P(gg) = .252
Ex: P(at least one Y) = P(YY) + P(gY) + P(Yg) = .75
n = 3
(1) P (Y Y Y) = .753
(3) P (Y Y g) = P(YgY) + P(gYY) = .752 .25
(3) P(Y g g) = P(g Y g) = P(g g Y) = .75 .252
(1) P(g g g ) = .253
The Binomial distribution (cont.)
In general: for x= 0, 1, …, n
P(xY’s, (n-x)g’s) = #ways (xY’s; (n-x)g’s) P(Y Y … Y g g … g)
x n-x
Where
1n xxn
p px
!
! !
! 1 2 1
n n
x x n x
n n n n
1 1 1
1 2 1 1 3 3 1 1 4 6 4 1
1 5 10 10 5 1 1 6 15 20 15 6 1 1 7 21 35 35 21 7 1
x
n By Pascal’s Triangle:
np
qpxnx
nnp
qpxnx
nx
qpx
nx
pnxxx
xnxn
x
xnxn
x
xnxn
x
n
x
)1(111
01
1
0
0
))!1(1()!1(
)!1(
)!(!
!
),;Bin(E
Calculate the Expectation (mean) of a binomial:
17 0.01651564 0.03762627 18 0.02538515 0.06301142 19 0.03651899 0.09953041 20 0.04930064 0.14883105 21 0.06260399 0.21143505 22 0.07493508 0.28637013 23 0.08470922 0.37107936 24 0.09059180 0.46167117 25 0.09179969 0.55347085 26 0.08826894 0.64173979
27 0.08064076 0.72238052 28 0.07008065 0.79246116 29 0.05799779 0.85045892 30 0.04575381 0.89621270 31 0.03443835 0.93065107 32 0.02475256 0.95540363 33 0.01700176 0.97240537
x P(x g’s) P(≤ x g’s) 0 0.00000000 0.00000000 1 0.00000000 0.00000000 2 0.00000000 0.00000000 3 0.00000000 0.00000000 4 0.00000002 0.00000002 5 0.00000010 0.00000012 6 0.00000052 0.00000064 7 0.00000235 0.00000299 8 0.00000910 0.00001209 9 0.00003100 0.00004308 10 0.00009402 0.00013710 11 0.00025642 0.00039352 12 0.00063392 0.00102744 13 0.00143038 0.00245782 14 0.00296294 0.00542076 15 0.00566251 0.01108327 16 0.01002735 0.02111062
For n=100:
Conclusions:
Chance of (84,16) = chance of getting (84,16) or anything “rarer” than (84,16)
= P(84,16) +P(85,15) + P(86,14) +…+P(100,0)= .0211
What is rare enough?
Biomedical Convention: .05 or 5%
RULE: If the experiment is rarer than the cutoff level, say that the experiment is not consistent with the hypothesis!
If less rare, say it is consistent!
Other Cutoffs:
Example: the Bruston Explosive Bolt.
Lower:.01 or 1% (1/100)for situations needing a lower error rate
Or higher:
Example: Physiological StudiesExample: Secondary mets. in pediatric Leukemia study
Conclusion:
No one cutoff works in every situation. The cutoff should be set before hand to avoid bias.
.001 or .1% (1/1000)
What is the cutoff? What is the p=value?
Chance [experiment statistic ≤ cutoff] or
Pr[X ≤ x | Ho] ≤ cutoff probability
Value inUniverse
StatisticUnder the Ho
If the probability statement is true, then decide that the experiment is not consistent with the hypothesis.
But there’s still a chance the experiment came from the Ho!
Decide from ExperimentH0 HA
H0 No Error Type I ()Actual Truth
HA Type II () No Error
Type I error = = PR[decide ~ Ho | Ho is true]
Three numbers: cutoff If you have one you have all
given
Type II error = ß = Pr[decide Ho | Ho is not true]
Summary of the Binomial
• Density function:• Distribution
function:
; , ; 1i n inb i n p p q q p
i
1
; ,x
i n i
i
nB x n p p q
i
Mean: np
Variance: npq
nxx i /
xxn
nxxs i
1
or 1/22
Another Way to look at flowers:
H0 : Yellow is dominantHA : Yellow is not dominant
Y g TotalObserved: 84 16 100
Expected percent: 75% 25%
Expected number: 75 25 100(n*proportion)
chi-square: (84-75)2 (16-25)2
75 25
22#terms
1
(observed - expected)
expectedi
Degrees of freedom=#terms-1
2 (84 75)2
75
(16 25)2
25
1.083.244.32
84, 16 Example:Degrees of freedom = 2-1 =1
So how extreme is 4.32?
X
109876543210
.8
.7
.6
.5
.4
.3
.2
.1
0.0
X
109876543210
1.0
.9
.8
.7
.6
.5
.4
.3
.2
.1
0.0
Support H0
Cutoff
Reject H0
X
8.07.57.06.56.05.55.04.54.03.53.0
.030
.025
.020
.015
.010
.005
0.000
X
8.07.57.06.56.05.55.04.54.03.53.0
1.00
.99
.98
.97
.96
.95
.94
.93
.92
Table B.1p-value x
0.01 6.635
0.025 5.024
0.05 3.841
4.32
Table B.1DF p-value x 1 0.05 3.841 4 0.05 9.48810 0.05 18.30720 0.05 31.410
Another Example:More than 2 groups.
Color: Green & YellowTexture: Smooth & Wrinkled
Hypothesis: Y is dominantS is dominantColor and Texture are independent
Pr(any cell)=1/16
Color
T (Y Y) (g Y) (Y g) (g g)
e (S S) (YY,SS) (gY,SS) (Yg, SS) (gg,SS)
x (Sw) (YY,Sw) (gY,Sw) (Yg,Sw) (gg,SS)
t (W S) (YY,wS) (gY,wS) (Yg,wS) (gg,wS)
u (ww) (YY,ww) (gY,ww) (Yg,wS) (gg,ww)
r
e
Ho : 9 : 3 : 3 : 1 16
YS Yw gS gw Total
Obs: 152 39 53 6 250
Pr(H0): 9/16 3/16 3/16 1/16
(0.5625) (0.1875) (0.1875) (0.0625)
Expected: 140.625 46.875 46.875 15.625
2: (152-140.625)2 (39-46.875)2 (53-46.875)2 (6-15.625)2
140.625 46.875 46.875 15.625
0.9201 + 1.3230 + 0.8003 + 5.929 = 8.97
DF: 1 + 1 + 1 + 1 - 1 = 3
2.05 0(3) 7.815 8.97; (p=.0293<0.05) reject H
So, where’s the Difference or Subdividing the H0
(1) too few gw
(2) about the right # of the others
combine YS + Yw + gS and compare to gw.But first, test (2):
H0 : YS, Yw, gS in 9 : 3 : 3Total
Obs: 152 39 53 244Pr(H0 ) : 9/15 = .6 3/15 = .2 3/15 = .2Exp: 146.4 48.8 48.82 : .2142 + 1.968 + 0.3615 = 2.544 D.F = 1 + 1 + 1 - 1= 2
205 (2) = 5.991 Accept H0
H0 : Others vs gw
15 : 1 TotalObs: 244 6 250
Pr(H0): 15 = .9375 1 = .062516 16
Expected: 234.375 15.625
2: 0.3953 + 5.929 = 6.324 DF : 1 + 1 -1 = 1
2.05 (1) = 3.841 Reject H0 and accept “Too few gw”
Summary:
2
0
2
1
ˆ( )ˆ
observation
ˆ expected value (under H )
i
thi
thi
ki i
i
f f
f
f i
f i
DF = k-1
Others:
(1) Continuity Correction
(2) Rule of Thumb to use 2 instead of Binomial:
If no more than 25% of the
And none ≤ 1, then use 2.
ˆ 5if
(3) Log-Likelihood Ratio
Yates correction
1
1 1
2 lnˆ
ˆ2 ln ln
ki
ii
i
k k
i i i ii i
fG f
f
f f f f
Entropy
Information
22
1
ˆ 0.5
ˆ
k i i
ci i
f f
f
(4) Heterogeneity Chi-SquareThere is often the need to combine chi-square analyses: the common cause is a batch effect where only a certain number of subjects (e.g., cages, school classes, laboratories, or clinics). There is a common hypothesis over all batches (e.g., gender, ethnicity, presence/absence of marker).
(a) Perform chi-square on each “batch”.
(b) Pool all batches and do a “pooled chi-square”.
(c) Sum the individual chi-squares (d.f.= sum of the individual batch chi-squares= k batches times the df for each batch)
(d) Subtract the pooled chi-square from the sum and test with (k-1)* individual batch df. This is the heterogeneity chi-square.
Ex 22.5: heterogeneity chi-square analysis. G. Mendal 1933, 32Experiment Ye llow seeds Green seeds Total
seeds (n)Chi-square DF
1 25 11 36 0.5926 1(27.0000) (9.0000)
2 32 7 39 1.0342 1(29.2500) (9.7500)
3 14 5 19 0.0175 1(14.2500) (4.7500)
4 70 27 97 0.4158 1(72.7500) (24.2500)
5 24 13 37 2.0270 1(27.7500) (9.2500)
6 20 6 26 0.0513 1(19.5000) (6.5000)
7 32 13 45 0.3630 1(33.7500) (11.2500)
8 44 9 53 1.8176 1(39.7500) (13.2500)
9 50 14 64 0.3333 1(48.0000) (16.0000)
10 44 18 62 0.5376 1(46.5000) (15.5000)
Total ofchi-squaresChi-squares
7.1899 10
(i.e., pooled 355 123 478 0.1367 1chi-squared (358.5000) (119.5000)
7.0532 9(0.50<P<0.75Difference total-pooled:
Ex 22.6: Heterogeneity Chi-SquareSample Right-
handedLeft-handed
N Chi-square DF
1 3 11 14 4.5714* 1(7.0000) (7.0000)
2 4 12 16 4.0000* 1(8.00000 (8.0000)
3 5 15 20 5.0000* 1(10.0000) (10.0000)
4 14 4 18 5.5556* 1(9.0000) (9.0000)
5 13 4 17 4.7647* 1(8.5000) (8.5000)
6 17 5 22 6.5455* 1(11.0000) (11.0000)
*Statisticalysignificant..Total of chi-squares
30.5372 6
Chi-squareof totals(i.e. pooled
56 51 107 0.2336 1
chi-square) (53.5000) (53.5000)Heterogeneity chi-square
30.2036*P< 0.001
5
Difference total-pooled:
II. SC Exchanges in Lymphocytes
Table 4.Distribution of exchanges between chromosomesChromosome Total Relative Proportional Observed
Length Length Exchanges
1 18.16 .08712 2662 16.90 .08107 2963 14.20 .06812 186
4-5 25.36 .12165 4426-12-X 78.06 .37446 88813-15 21.00 .10074 25516-18 18,28 .08769 12519-20 8.50 .04078 26
21-22-Y 8..00 .03838 23Total 208.46 1.00001 2507
TEST:Ho: Exchanges are proportional to length of chromosome
Problem Set 1: