how$do$we$testwhether$probability$distribu2ons$of$ the$two ... · χ2 test for gamma distribution...
Post on 27-Jun-2020
1 Views
Preview:
TRANSCRIPT
How do we test whether probability distribu2ons of the two datasets are sta2s2cally different?
• Reading assignment: • Hartman’s note-‐1: p 15-‐23 • Khan Academy: hDps://www.khanacademy.org/math/probability/sta2s2cs-‐inferen2al/chi-‐square/v/pearson-‐s-‐chi-‐square-‐test-‐goodness-‐of-‐fit
• hDps://www.khanacademy.org/math/probability/sta2s2cs-‐inferen2al/anova/v/anova-‐3-‐hypothesis-‐test-‐with-‐f-‐sta2s2c
Test probability distribution-Goodness of fit test
• To determine whether two data are draw from the same sta2s2cal distribu2on, or test whether one data are draw from a specific sta2s2cal distribu2on, e.g., normal, gamma, etc. The most commonly used tests include: – Parametric goodness of fit tests: χ2 test – Non-‐parametric test: Kolmogorov-‐Smirnov test (can be used to test for parametric distribu2on func2on, e.g., Q-‐Q test for normal distribu2on)
– F-‐test: determine whether the difference of two data distribu2ons are due to varia2on within each data or due to difference between the two data.
Test the difference in probability distribu4on of the data samples:
• Chi-‐square (χ2) test: A simple and common goodness-‐of-‐fit test. It compares a data histogram (PDF) of the binned (discrete) data sample distribu2on, to parametric PDF, such as Gaussian, or Gamma, distribu2ons. Thus, it also can be used to test whether the data PDF is Gaussian or Gamma, etc.
χ 2 =(# of observed - # expected)2
# of expectedbins∑ =
(# of observed - N ⋅Pr{data in bin})2
N ⋅Pr{data in bin)bins∑
• If the # (probability) of observed data in various bins are very different from those expected based on a specified PDF, e.g., Gaussian, χ2 will be larger than the cri2cal value at a specified significance (say 95%), then the null hypothesis that the PDF of the observed data is the same as the specified PDF can be rejected at the specified significance.
• You can also use the χ2 to determine whether the two datasets are significantly different from each other.
Example-‐5: χ2 test
• Test whether the Jan rainfall in Ithaca, NY in the past 50 year (N=50) has Gamma or Gaussian distribu2on:
Bin/class <1” 1-‐1.5” 1.5-‐2” 2-‐2.5”
2.5-‐3”
≥3”
Obs # 5 16 10 7 7 5
Gamma
Probability, Pr 0.161 0.215 0.210 0.161 0.108 0.145
Expected #: NPr 8.05 10.75 10.5 8.05 5.40 7.25
Gaussian
Probability 0.195 0.146 0.173 0.178 0.132 0.176
Expected # 9.75 7.30 8.65 8.90 6.60 8.80
0
2
4
6
8
10
12
14
16
18
<1 1.5 2 2.5 3 >3
# of Sam
ples
rainfall bins (inches)
observed
gamma
gaussian
χ 2 test for Gamma distribution
χ 2 =(3−8.05)2 + (16−10.75)2 + (10−10.5)2 + (7−8.05)2 + (7− 5.4)2 + (5− 7.25)2
(8.05+10.75+10.5+8.05+ 5.4+ 7.25) = 5.05χ 2 test for Gaussian distribution:
χ 2 =(3− 9.75)2 + (16− 7.3)2 + (10−8.65)2 + (7−8.9)2 + (7− 6.6)2 + (5−8.8)2
(9.75+ 7.3+8.65+8.9+ 6.6+8.8) = 14.96Critical value for the # of freedom ν=(# of bins-# of parameters-1)=6-2-1=3 at 95% confidence, χc
2 is 7.815 based on χ 2 distribution Table. Thus, the null hypothesis that the Jan. rainfall in Ithaca has Gamma distribution cannot be rejected at 95% confidence (χ 2 = 5.05< χc
2 = 7.815). However, the null hypothesis that the Jan. rainfall in Ithaca has Gaussian distribution can be rejected at 95% confidence (χ 2 =14.96 > χc
2 = 7.815).
χ2-‐Test of uncertainty of the variance To estimate the uncertainty of the variance derivedfrom the data we have with limited samples, we can use χ 2 test:
χ 2 = (N −1) s2
σ 2 where s2 is the variance derived from the data,
σ is the true variance, N: number of the samples
Assume χ 2 follows a normal distribution, f(χ )=foχυ−1e
12χ 2
where ν is the degree of freedoms ν= N - number of parameters being estimatedTo reject the null hypothesis: s2 is different σ 2 at 95% significance,
we need to show χ0.0252 < (N −1) s
2
σ 2 < χ0.9752 because χ 2 is not symmetric
(N −1)s2
χ0.0252 <σ 2 <
(N −1)s2
χ0.9752
The Kolmogorov-‐Smirnov (K-‐S) Test:
• K-‐S test: Another very frequently used test of the goodness of fit for comparing the cumula2ve distribu2on func2on of the two datasets, or the empirical data with the theore2cal PDF of con2nuous distribu2ons. It does not require to know the distribu2on func2ons of the data.
• Cau2on: when compare the data with the PDF obtained from the data fiing, you would likely to reduce the chance to reject the null hypothesis.
A variety applica2ons of the K-‐S test:
• Test whether two data sets are draw from the same sta2s2cal distribu2on-‐Smirnov two sample test
• Test whether a dataset is drawing from a specific probability distribu2on, e.g., Gaussian, Gamma, etc – Lilliefors test – Q-‐Q test
KS-‐Test compares CDFs of the two datasets
One can compare the CDF of two datasets, or with specified distribu2on, such With Gaussian distribu2on with
The Kolmogorov-‐Smirnov (K-‐S) Test:
• Commonly used K-‐S test is called Lilliefors (1967) test:
Dn = maxx
|F1(x)-F2 (x)| Maximum difference between the CDFs of the two
datasets, F(x) is the CDF evaluated at x.The critical value for K-S test at a specified signficant leve, α,
Cα =Kα
N + 0.12+ 0.11/ N,
where Kα =1.224, 1.358 and 1.628 for α=0.10, 0.05 and 0.01, respectively.If Dn >Cα, then the null hypothesis that the data has the same CDF as the specified or theoretic CDF can be rejected as the 1-α significance.
F1(x)
F2x)
Confident interval:
• One can determine the confidence level that a true probability F(x) lies within FN(x)±Cα at 1-‐α confidence.
Example-‐6: • K-‐S test on whether Jan rainfall in Ithaca during 1933-‐82 has
Gaussian or Gamma distribu2on:. rainfall bin obs CDF GammaCDF GaussianCDF obs-‐Gam Obs-‐Gaussian
50
<1 0.1 0.161 0.10 0.061 0 1.5 0.42 0.376 0.30 0.03 0.12 2 0.62 0.586 0.42 0.034 0.20
2.5 0.76 0.61 0.62 0.015 0.14
3 0.9 0.850 0.76 0.05 0.14 >3 1 1 1 0 0
D(N) 0.061 0.2
C0.05 =Ks
50 + 0.12+ 0.11/ 50=
1.3587.07+ 0.12+ 0.016
= 0.188
For Gamma distribution: D50 = 0.061<C0.05 = 0.169. Thus, the null hypothesis that Jan rainfall distribution is drawn fromGamma distribution cannot be rejected at 95% confidence. We can also determine that the Jan rainfall distribution, F50 (x), at Ithcaa fit the true Gamma distribution within F50 (x)±C0.05 at 95% confidence.
For Gaussian distribution: D50 = 0.2 >C0.05 = 0.188. The null hypothesis that Jan rainfall distribution is drawn from Gaussian distributioncan be rejected at 95% confidence.
Test if two datasets come from the same distribu2on (are sta2s2cally the same or different)
Smirnov two-‐sample test: The null hypothesis: two batches of data, X1 with total number of samples ofn, and X2 with total number of samples m come from the same distribution,the Smirnov test looks for the maximum absolute difference between thesetwo batches of data:Ds =max
xFn (x1),−Fm (x2 )
and compare to the critical value at α level of probability, Ds,c,
Ds,c = −12
(1n+
1m
) ⋅ lnα2
If Ds >Ds,c, the null hypothesis can be rejected at 1-α confidence level.
Example-‐7
• Test whether the Aus2n climatological monthly rainfall is sta2s2cally different from that of Dallas.
2.23 2.37 2.51 2.28 2.66
4.38
2.45
1.63
2.49
3.95
2.95
2.25 2.06 2.7
3.49 3.07
4.92
4.11
2.21 1.87
2.84
4.79
2.88 2.74
1 2 3 4 5 6 7 8 9 10 11 12
climatological monthly rainfall (in) Aus2n Dallas
Month AusOn rainfall (in)
Dallas rainfall (in)
1 2.23 2.06
2 2.37 2.7
3 2.51 3.49
4 2.28 3.07
5 2.66 4.92
6 4.38 4.11
7 2.45 2.21
8 1.63 1.87
9 2.49 2.84
10 3.95 4.79
11 2.95 2.88
12 2.25 2.74
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5
CDF
Rainfall Bins
Smirnov two sample test:
Aus2n CDF
Dallas CDF
Use Smirnov two-‐sample test:
The null hypothesis: the two data are drawnfrom same statistical distribution:Ds =max
xFn (x1),−Fm (x2 ) = 0.15
Ds,95% = −12
(1n+
1m
) ⋅ lnα2
= −12
( 112
+1
12) ⋅ ln 0.05
2= 0.55
Ds,90% = −12
(1n+
1m
) ⋅ lnα2
= −12
( 112
+1
12) ⋅ ln 0.1
2= 0.35
Ds = 0.15Ds,90% = 0.35<Ds,95% = .55, the null hypothesis cannot be rejected at 90% or 95% confidence level.
Quan2le-‐quan2le test:
• A common approach to compare whether two datasets have same distribu2on.
• An simple and effect way of tes2ng whether the data has Gaussian distribu2on.
• Idea behind Q-‐Q test: If the data can fit perfected to a fiDed Gaussian distribu2on, the empirical CDF of this data would agree perfectly with the CDF of the fiDed Gaussian distribu2on. Consequently, the probability of a given value in the data would agree perfectly with that expected from the fiDed Gaussian distribu2on, or they would have perfect correla2on, r=1.
• One can either plot quan2le ranked empirical values, xi (ith the smallest value) against that expected from the fiDed Gaussian distribu2on or the empirical xi or ln (xi) vs. z values in the Q-‐Q plot. ln (xi) is ooen used for non-‐linear data (exponen2ally increase or decrease) to improve the linearity.
How does it work?
• If we have a data with n samples and ranks them from the smallest to largest values, each sample value has 1/nth probability to occur, and its value, xi is linearly related to zi=(xi-‐X)/s, where X is the mean and s is the standard devia2on es2mated from the data. Thus, if the data were normally distributed, its zi (~xi) distribu2on would be similar to that of zi of a normal distribu2on.
Generated by CamScanner from intsig.com
Generated by CamScanner from intsig.com
Jan rainfall in Ithaca
For a dataset, Xi, i=1,n, we can rank them from the smallest to the largest value. For the ith smallest value, its cumula2ve probability can be determined by Either cp(xi)=(i-‐0.5)/n or a varia2on of similar formula, i.e., cp=(i-‐1/3)/(n+1/3) (a=1/3 for Tukey ploing posi2on) Or more generally cp(xi)=(i-‐a)/(n+1-‐2a), where 0≤a≤1 The find z-‐value of these cp(xi), or inverse of the CP(xi) using excel func2on (norm.s.inv) and plot the ranked data against z-‐value s of each sample. If the dataset has a normal distribu2on, the data points will follow a straight line. Otherwise, they will deviate from the straight line. For more informa2on and examples, see hDp://www.youtube.com/watch?v=X9_ISJ0YpGw hDps://www.youtube.com/watch?v=eYp9QvlDzJA
Test whether the wet season onset over S. Amazon follow Gaussian distribu2on
The null hypothesis: the wet season onset data samples follow Gaussian distribu2on. Test: • Compute z-‐value of each value for the sorted data from the smallest to
largest value • Plot in the Q-‐Q plot. For a Guassian distribu2on, the data samples
should follow a strait line, as seen in the figure below. • Compute the correla2on between the sorted data samples and
expected data samples based on the the Gaussian CDF, r(obs, exp)=0.99~1.0
• Thus, the Q-‐Q test support the null hypothesis with high confidence.
55.00 56.00 57.00 58.00 59.00 60.00 61.00 62.00 63.00 64.00 65.00 66.00
-‐2.00 -‐1.00 0.00 1.00 2.00
Q-‐Q plot for normal distribuOon
Q-‐Q plot for normal distribu2on
Years SA24*onset sorted Tukey*position z6values1979.00 1.00 58.00 56.00 0.07 61.48
1980.00 1.00 58.00 56.00 0.07 61.48 1.001981.00 1.00 57.00 56.00 0.07 61.48 61.001982.00 2.00 61.00 57.00 0.17 60.94 1.001983.00 2.00 57.00 57.00 0.17 60.94 61.001984.00 2.00 58.00 57.00 0.17 60.94 1.001985.00 2.00 57.00 57.00 0.17 60.94 61.001986.00 3.00 56.00 58.00 0.28 60.60 61.001987.00 3.00 61.00 58.00 0.28 60.60 1.001988.00 3.00 56.00 58.00 0.28 60.60 61.001989.00 3.00 60.00 58.00 0.28 60.60 1.001990.00 3.00 58.00 58.00 0.28 60.60 61.001991.00 3.00 59.00 58.00 0.28 60.60 1.001992.00 3.00 61.00 58.00 0.28 60.60 1.001993.00 4.00 59.00 59.00 0.38 60.31 61.001994.00 4.00 65.00 59.00 0.38 60.31 1.001995.00 4.00 63.00 59.00 0.38 60.31 61.001996.00 4.00 59.00 59.00 0.38 60.31 61.001997.00 4.00 64.00 59.00 0.38 60.31 1.001998.00 5.00 58.00 60.00 0.48 60.04 61.001999.00 5.00 60.00 60.00 0.48 60.04 1.002000.00 5.00 60.00 60.00 0.48 60.04 0.002001.00 5.00 62.00 60.00 0.48 60.04 1.002002.00 5.00 58.00 60.00 0.48 60.04 61.002003.00 6.00 60.00 61.00 0.59 0.22 1.002004.00 6.00 59.00 61.00 0.59 0.22 61.002005.00 6.00 59.00 61.00 0.59 0.22 0.002006.00 6.00 57.00 61.00 0.59 0.22 61.002007.00 7.00 58.00 62.00 0.69 0.49 1.002008.00 8.00 64.00 63.00 0.79 0.82 1.002009.00 9.00 61.00 64.00 0.90 1.26 61.002010.00 9.00 60.00 64.00 0.90 1.26 61.002011.00 10.00 56.00 65.00 1.00 1.48 61.00
S 24.92 z6values Mean 57.67Var(S) 4065.67 stadd*dev 2.31z= 0.41 R(obs,*exp) 1.00CDF 0.49
0
20
40
60
80
100
120
-‐4 -‐2 0 2 4 6
Absolute stream flow Q-‐Q
Absolute stream flow Q-‐Q
Q-‐Q test on Gaussian distribu2on of the bull creek mean stream flow in April 2007-‐2014.
• The leo and right-‐tails extreme values are much larger than those of Gaussian distribu2on
0 0.005 0.01
0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
0 20 40 60 80 100 120
PDF of the bull creek flow
Q-‐Q plot of the bull creek flow
Non-‐Gaussian distribu2on
-‐3
-‐2
-‐1
0
1
2
3
-‐4 -‐2 0 2 4 6
ln(flow)
ln(flow)
Q-‐Q test on Gaussian distribu2on of the bull creek mean stream flow in April 2007-‐2014.
• Test distribu2on of In(stream flow)
PDF of the bull creek flow Q-‐Q plot of the log of bull creek flow
S2ll Non-‐Gaussian distribu2on
0
0.01
0.02
0.03
0.04
0.05
-‐4 -‐2 0 2 4 6
ln(flow)
Summary for the Goodness of fit test:
• The most commonly used tests include: – Parametric goodness of fit tests: χ2 test, based on comparison to fiDed PDF of a chosen parametric
distribu2on func2on, e.g., Gaussian, etc. – Non-‐parametric test: Kolmogorov-‐Smirnov test (can be used to test for parametric distribu2on func2on, e.g., Q-‐Q test for normal distribu2on)
top related