experimental design & analysis nonparametric methods april 17, 2007 doctoral seminar, spring...
Post on 19-Dec-2015
222 views
TRANSCRIPT
Experimental Design & Analysis
Nonparametric Methods
April 17, 2007
DOCTORAL SEMINAR, SPRING SEMESTER 2007
2
Nonparametric Tests
Occasions for use?The response variable (residual) cannot
logically be assumed to be normally distributed, a key assumption of ANOVA models
Limited data Counts and rank are relevant units instead
of means
3
Wilcoxon Rank Sum Test
Nonparametric version of a paired samples t-test Example of corn yield as a function of weeding
Yield 153.1 156.0 158.6 165.0 166.7 172.2 176.4 176.9
Rank 1 2 3 4 5 6 7 8
Treatment Sum of Ranks
No weeds 23Weeds 13
4
Wilcoxon Rank Sum Test
Calculate test statistic by calculating the mean μ and the standard deviation
μ = n1(N+1)/2= 4(8+1)/2 = 4*9/2= 36/2=18
= sqrt n1n2(N+1)/12= sqrt (4)(4)(8+1)/12= sqrt 144/12= sqrt 12= 3.464
σ
proc univariate data = MYDATA; var CORN; run;
5
Wilcoxon Signed Rank Sum Test
Nonparametric version of a paired samples t-test Study difference between two variables (Story 1 vs.
Story 2) Data step necessary to create the difference of the
two scores for each subject
data MYDATA; set MYDATA;
diff = STORY1 – STORY2;
proc univariate data = MYDATA;
var diff; run;
6
Wilcoxon Mann-Whitney Test
Nonparametric version of independent samples t-test can be used when you do not assume that the dependent variable is a normally distributed interval variableAssume that the dependent variable is ordinal
proc npar1way data = mydata wilcoxon;
class female;
var write;
run;
7
Kruskal Wallis Test
Used when you have one independent variable with two or more levels and an ordinal dependent variableNonparametric version of ANOVAGeneralized form of the Mann-Whitney test
method, as it permits two or more groups
proc npar1way data = mydata; class prog; var write; run;
8
Chi-Square Test
Used when you want to see if there is a relationship between two categorical variables
Chi-square test assumes that the expected value for each cell is 5 or higher If this assumption is not met, use Fisher's exact test
In SAS, the chisq option is used on the tables statement to obtain test statistic and p-value
proc freq data = mydata; tables school*gender / chisq; run;
9
Fisher’s Exact Test
Used when you want to conduct a chi-square test, but one or more of your cells has an expected frequency of 5 or lessFisher's exact test has no such assumption
and can be used regardless of how small the expected frequency is
proc freq data = mydata; tables school*race / fisher; run;
10
Factorial Logistic Regression
Used when you have two or more categorical independent variables but a dichotomous dependent variable The desc option on the proc logistic statement is
necessary so that SAS models the odds of being female (i.e., female = 1). The expb option on the model statement tells SAS to show the exponentiated coefficients (i.e., the odds ratios).
proc logistic data = mydata desc; class prog schtyp; model female = prog schtyp prog*schtyp / expb; run;
11
Nonparametric Correlation
Spearman correlation is used when one or both of the variables are not assumed to be normally distributed and interval (but are assumed to be ordinal)
The values of the variables are converted in ranks and then correlated
The spearman option on the proc corr statement tells SAS to perform a Spearman rank correlation instead of a Pearson correlation
proc corr data = mydata spearman;var read write; run;
Nonparametric Tests: Advantages & Shortcomings
13
Common Nonparametric Tests
Normal theory based test
Corresponding nonparametric test
Purpose of test
t test for independent samples
Mann-Whitney U; Wilcoxon rank-sum
Compares two independent samples
Paired t test Wilcoxon matched pairs signed-rank
Examines a set of differences
Pearson correlation coefficient
Spearman rank correlation coefficient
Assesses the linear association between two variables
One way analysis of variance (F test)
Kruskal-Wallis analysis of variance by ranks
Compares three or more groups
Two-way analysis of variance
Friedman two-way analysis of variance
Compares groups classified by two different factors
14
Why Use Nonparametric Tests?
When data are not normally distributed and the measurements at best contain rank-order information, computing the standard descriptive statistics (e.g., mean, standard deviation) is sometimes not the most informative way to summarize the data
15
Advantages of Nonparametrics
1. Nonparametric test make less stringent demands of the data (resistant to outliers, shape of distribution)
2. Nonparametric procedures can sometimes be used to get a quick answer with little calculation
3. Nonparametric methods provide an air of objectivity when there is no reliable (universally recognized) underlying scale for the original data
16
Why Not Use All the Time?
Parametric tests are often preferred because: They are robust They have greater power efficiency (greater
power relative to the sample size) They provide unique information (e.g., the
interaction in a factorial design) Parametric and nonparametric tests often
address two different types of questions
17
Different Nonparametric Tests Same Results? Different nonparametric tests may yield
different resultsAdvisable to run different nonparametric tests
18
Large Data Sets
Nonparametric methods are most appropriate when the sample sizes are small When data set is large it often makes little sense to
use nonparametric statistics at all
How large is large enough?
19
Small Data Sets
What happens when you use a nonparametric test with data from a normal distribution?Greater incidence of Type II errorThe nonparametric tests lack statistical power
with small samples
20
Shortcomings
Non-parametric tests cannot give very significant results for very small samples as all the possible rank-sums are fairly likely
They do not, without the addition of extra assumptions, give confidence intervals for the means or medians of the underlying distributions
Assume that the data can be ordered – power of the test diminished if there are lots of ties
21
StandardizationUsed in situations in which you need to adjust and rescaleobservations to have a different mean and standard deviation
Example: Midterm test scores are to be rescaled to have a mean of 75 and a standard deviation of 10
data midterm; input grade @@; datalines; 64 71 80 69 55 84 77 63 68 90 66 61 84 43 80 66 68 89 71 59 52 62 60 79 43 63 68 72 60 77 80 73 40 74 63 68 95 66 59 70 73 62 64 62 77 81 73 64 82 59 84 70 70 71 49 90 84 66 69 62;proc univariate data=midterm plot; var grade; run;
22
Moments
N 60 Mean 69.06667 Std Dev 11.60489
Stem Leaf # Boxplot 9 5 1 | 9 00 2 | 8 9 1 | 8 000124444 9 | 7 7779 4 +-----+ 7 00011123334 11 | | 6 6666888899 10 *--+--* 6 0012222333444 13 +-----+ 5 5999 4 | 5 2 1 | 4 9 1 | 4 033 3 | ----+----+----+----+ Multiply Stem.Leaf by 10**+1
23
proc standard data=midterm out=adjusted mean=75 std=10; var grade; run;
The new data set, ADJUSTED, has one variable, also called GRADE, which has a mean of 75 and a standard deviation of 10. For example, the grade of 95 in the MIDTERM dataset becomes a grade of 97.2 in the ADJUSTED dataset. 0.86(95-69.1) +75=97.2
6048911
0666769
x
x
.
.
010
075
y
y
.
.
0750666769x6048911
10xy yx
x
y ...
24
The STDIZE procedure standardizes one or more numeric variables in a SAS data set by subtracting a location measure and dividing by a scale measure. A variety of location and scale measures are provided, including estimates that are resistant to outliers and clustering. Some of the well-known standardization methods such as mean, median, std, range, Huber's estimate, Tukey's biweight estimate, and Andrew's wave estimate are available in the STDIZE procedure.
In addition, you can multiply each standardized value by a constant and add a constant. Thus, the final output value is
result = add + multiply ×[((original - location))/scale]
where
result = final output value
add = constant to add (ADD= option)
multiply = constant to multiply by (MULT= option)
original = original input value
location = location measure
scale = scale measure
PROC STDIZE can also find quantiles in one pass of the data, a capability that is especially useful for very large data sets. With such data sets, the UNIVARIATE procedure may have high or excessive memory or time requirements.
25
options ls=78 ps=200 nocenter nodate;data midterm; input grade @@; datalines; 64 71 80 69 55 84 77 63 68 90 66 61 84 43 80 66 68 89 71 59 52 62 60 79 43 63 68 72 60 77 80 73 40 74 63 68 95 66 59 70 73 62 64 62 77 81 73 64 82 59 84 70 70 71 49 90 84 66 69 62;run;proc stdize data=midterm out=adjusted method=std add=75 mult=10 ; var grade; run;proc print data=adjusted; run;proc univariate data=adjusted; var grade; run;
26
One-Sample Tests of Location
What is the equivalent of the one-sample normal test or one-sample t test for the hypothesis that the true mean is equal to a specified value?
oA
oo
H
H
:
:
Sign TestWilcoxon Sign Rank Test
oA
oo
H
H
:
:
Where is the (unknown) median of the population.
27
PROC UNIVARIATE in SAS automatically performs three tests of location but it does so by testing if the "typical" value is zero.
data midterm; input grade @@;/* If GRADE is typically 75, then GRADE-75 should typically be zero. */ diff=grade-75; label diff='Points above 75'; datalines; 64 71 80 69 55 84 77 63 68 90 66 61 84 43 80 66 68 89 71 59 52 62 60 79 43 63 68 72 60 77 80 73 40 74 63 68 95 66 59 70 73 62 64 62 77 81 73 64 82 59 84 70 70 71 49 90 84 66 69 62; run;proc univariate data=midterm; var diff; run;
T- TestSign TestWilcoxon Sign Rank Test
28
Univariate Procedure
Variable=DIFF Points above 75
Moments
N 60 Sum Wgts 60 Mean -5.93333 Sum -356 Std Dev 11.60489 Variance 134.6734 Skewness -0.17772 Kurtosis 0.250711 USS 10058 CSS 7945.733 CV -195.588 Std Mean 1.498185 T:Mean=0 -3.96035 Pr>|T| 0.0002 Num ^= 0 60 Num > 0 17 M(Sign) -13 Pr>=|M| 0.0011 Sgn Rank -481.5 Pr>=|S| 0.0002
T-test (=75)
Sign Test (=75)
Wilcoxon Sign Rank Test (=75)
29
Ranking
Many nonparametric procedures rely on the relative ordering, or ranking, of the observations.
Suppose a new Florida resident wants to see if the prices of houses in Gainesville are higher than the prices of homes in smaller towns near Gainesville. She collects a random sample of 10 prices for houses on the market in Gainesville and 10 prices of homes in other cities in Alachua County.
30
data homes; input location $ price @@; datalines;Gville 74500 Gville 269000 Gville 94500Gville 86900 Gville 99900Gville 91500 Gville 72000 Gville 78000Gville 289000 Gville 114000County 32000 County 125000 County 105900County 120000 County 139900County 72000 County 85000 County 74500County 199500 County 2200000; run;
Does one location tends to have higher-ranked prices than the other?
Does one location tends to have higher average prices than the other?
Influential observation
0H 21o :
More higher ranking homes in the county than expected at random.
31
How to get the ranks?OBS LOCATION PRICE RANKCOST
1 Gville $74,500 4.5 2 Gville $269,000 18.0 3 Gville $94,500 10.0 4 Gville $86,900 8.0 5 Gville $99,900 11.0 6 Gville $91,500 9.0 7 Gville $70,000 2.0 8 Gville $78,000 6.0 9 Gville $289,000 19.0 10 Gville $114,000 13.0 11 County $32,000 1.0 12 County $125,000 15.0 13 County $105,900 12.0 14 County $120,000 14.0 15 County $139,900 16.0 16 County $72,000 3.0 17 County $85,000 7.0 18 County $74,500 4.5 19 County $199,500 17.0 20 County $2,200,000 20.0
proc rank data=homes out=rankdata ties=mean; var price; ranks rankcost; run;proc print data=rankdata; format price dollar10.; run;
32
Another way to rank the data would be to create groups of least expensive, inexpensive, moderate, expensive, and very expensive price ranges. PROC RANK can do this with the GROUPS option.
OBS LOCATION PRICE PRICEGRP 1 Gville $74,500 1 2 Gville $269,000 4 3 Gville $94,500 2 4 Gville $86,900 1 5 Gville $99,900 2 6 Gville $91,500 2 7 Gville $72,000 0 8 Gville $78,000 1 9 Gville $289,000 410 Gville $114,000 311 County $32,000 012 County $125,000 313 County $105,900 214 County $120,000 315 County $139,900 316 County $72,000 017 County $85,000 118 County $74,500 119 County $199,500 420 County $2,200,000 4
proc rank data=homes out=rankdata groups=5; var price; ranks pricegrp; run;proc print data=rankdata; format price dollar10.; run;
Grouping starts at 0.
33
PROC RANK can be used to produce a better normal probability plot than the one produced by PROC UNIVARIATE.
We use PROC RANK to calculate the normal scores. If the data are indeed normally distributed, the Blomberg-calculated scores
(BLOM option) should provide the best straight line. Consider the 20 homes to be a random sample of all homes for sale in Alachua
County, and we want to see if price or log(price) more closely follows a normal distribution.
data homes; set homes; logprice=log(price); run;proc rank data=homes out=rankdata normal=blom; var price logprice; ranks norm1 norm2; run;proc plot data=rankdata; plot price*norm1 logprice*norm2; run;
34
Plot of PRICE*NORM1. Legend: A = 1 obs, B = 2 obs, etc.
PRICE|3000000+ | | | | | A2000000+ | | | | |1000000+ | | | | A A | A A AA AA A A A A A 0| A B B A +-+------------+------------+------------+------------+- -2 -1 0 1 2
RANK FOR VARIABLE PRICE
Plot of LOGPRICE*NORM2. Legend: A = 1 obs, B = 2 obs, etc.
35
Not very normal looking!
36
LOGPRICE| 16+ | | | | A | 14+ | | | | A A | A 12+ A | AA A A A | B B A A A AA | | | A 10+ ++------------+------------+------------+------------+- -2 -1 0 1 2
RANK FOR VARIABLE LOGPRICE
37
Better, but still not very normal!
38
Comparing Two or More Groups
• The nonparametric version of analysis of variance is based on ranks.
• The Mann-Whitney test and the Wilcoxon rank sum test are equivalent nonparametric techniques to compare two groups, while the Kruskal-Wallis test is ordinarily used to compare three or more groups.
• All of these are available in PROC NPAR1WAY (nonparametric 1-way analysis of variance) in SAS.
39
NPAR1WAYPROC NPAR1WAY performs tests for location and scale differences based on the following scores of a response variable: Wilcoxon, median, Van der Waerden, Savage, Siegel-Tukey, Ansari-Bradley, Klotz, and Mood Scores. Additionally, PROC NPAR1WAY provides tests using the raw data as scores. When the data are classified into two samples, tests are based on simple linear rank statistics. When the data are classified into more than two samples, tests are based on one-way ANOVA statistics. Both asymptotic and exact p-values are available for these tests.
PROC NPAR1WAY also calculates the following empirical distribution function (EDF) statistics: the Kolmogorov- Smirnov statistic, the Cramer-von Mises statistic, and, when the data are classified into only two samples, the Kuiper statistic. These statistics test whether the distribution of a variable is the same across different groups
40
proc npar1way wilcoxon data=homes; class location; var price; run;
N P A R 1 W A Y P R O C E D U R EWilcoxon Scores (Rank Sums) for Variable PRICEClassified by Variable LOCATION Sum of Expected Std Dev Mean LOCATION N Scores Under H0 Under H0 Score
Gville 10 100.500000 105.0 13.2237824 10.0500 Other 10 109.500000 105.0 13.2237824 10.9500Average Scores Were Used for Ties
Wilcoxon 2-Sample Test (Normal Approximation)(with Continuity Correction of .5)
S = 100.500 Z = -.302485 Prob > |Z| = 0.7623
T-Test Approx. Significance = 0.7656
Kruskal-Wallis Test (Chi-Square Approximation)CHISQ = 0.11580 DF = 1 Prob > CHISQ = 0.7336
41
Other Rank Tests
When there are two factors with no interaction, as in a randomized complete block design, Friedman's chi-square test is a non-parametric test that can be used to examine treatment differences. Friedman's test is available in SAS under PROC FREQ, but it is fairly complicated to perform. PROC FREQ also offers versions of rank sum tests for ordinal data. See SAS documentation for details.
42
The LIFETEST procedure can be used with data that may be right-censored to compute nonparametric estimates of the survival distribution and to compute rank tests for association of the response variable with other variables. The survival estimates are computed within defined strata levels, and the rank tests are pooled over the strata and are therefore adjusted for strata differences.
43
Non-Parametric CorrelationThe Pearson Product Moment correlation coefficient measures the strength of the tendency for two variables X and Y to follow a straight line.
Suppose we are more interested in measuring the tendency for X to increase or decrease with Y, without necessarily assuming a strictly linear relationship.
• Spearman correlation coefficient - Pearson correlation between ranks of X and ranks of Y.
• Kendall correlation coefficient - probability of observing Y2 > Y1 when X2 > X1.
44
• Twelve students take a test, and we want to see if the students who finished the test early had higher scores than those who finished later.
• We don't know the exact time that each student spent on the test, but we do know the order in which the tests were turned in to be graded.
• With only have rank data for the time variable, the Pearson linear correlation coefficient would not be appropriate. (The last person to turn in the test could have taken 30 minutes, an hour, or two hours, and the guess of the exact time would greatly influence the resulting Pearson correlation.)
• Both the Spearman and Kendall correlation coefficients could legitimately be used.
data students; input order grade @@; datalines; 1 90 2 74 3 76 4 60 5 68 6 86 7 92 8 60 9 78 10 70 11 78 12 64; run;proc plot data=students; plot grade*order; run;proc corr data=students spearman kendall; var grade; with order; run;
45
Plot of GRADE*ORDER. Legend: A = 1 obs, B = 2 obs, etc.
GRADE| 100+ | | A |A | A | 80+ | A A A | A | A | A | A 60+ A A ++----+----+----+----+----+----+----+----+----+----+----+- 1 2 3 4 5 6 7 8 9 10 11 12
ORDER
46
47
Correlation Analysis
Spearman Correlation Coefficients / Prob > |R| under Ho: Rho=0/ N = 12
GRADE
ORDER -0.17544 0.5855
Kendall Tau b Correlation Coefficients/ Prob > |R| under Ho: Rho=0 / N = 12
GRADE
ORDER -0.12309 0.5815