experimental design & analysis nonparametric methods april 17, 2007 doctoral seminar, spring...

Experimental Design & Analysis

Nonparametric Methods

April 17, 2007

DOCTORAL SEMINAR, SPRING SEMESTER 2007

2

Nonparametric Tests

Occasions for use?The response variable (residual) cannot

logically be assumed to be normally distributed, a key assumption of ANOVA models

Limited data Counts and rank are relevant units instead

of means

3

Wilcoxon Rank Sum Test

Nonparametric version of a paired samples t-test Example of corn yield as a function of weeding

Yield 153.1 156.0 158.6 165.0 166.7 172.2 176.4 176.9

Rank 1 2 3 4 5 6 7 8

Treatment Sum of Ranks

No weeds 23Weeds 13

4

Wilcoxon Rank Sum Test

Calculate test statistic by calculating the mean μ and the standard deviation

μ = n1(N+1)/2= 4(8+1)/2 = 4*9/2= 36/2=18

= sqrt n1n2(N+1)/12= sqrt (4)(4)(8+1)/12= sqrt 144/12= sqrt 12= 3.464

σ

proc univariate data = MYDATA; var CORN; run;

5

Wilcoxon Signed Rank Sum Test

Nonparametric version of a paired samples t-test Study difference between two variables (Story 1 vs.

Story 2) Data step necessary to create the difference of the

two scores for each subject

data MYDATA; set MYDATA;

diff = STORY1 – STORY2;

proc univariate data = MYDATA;

var diff; run;

6

Wilcoxon Mann-Whitney Test

Nonparametric version of independent samples t-test can be used when you do not assume that the dependent variable is a normally distributed interval variableAssume that the dependent variable is ordinal

proc npar1way data = mydata wilcoxon;

class female;

var write;

run;

7

Kruskal Wallis Test

Used when you have one independent variable with two or more levels and an ordinal dependent variableNonparametric version of ANOVAGeneralized form of the Mann-Whitney test

method, as it permits two or more groups

proc npar1way data = mydata; class prog; var write; run;

8

Chi-Square Test

Used when you want to see if there is a relationship between two categorical variables

Chi-square test assumes that the expected value for each cell is 5 or higher If this assumption is not met, use Fisher's exact test

In SAS, the chisq option is used on the tables statement to obtain test statistic and p-value

proc freq data = mydata; tables school*gender / chisq; run;

9

Fisher’s Exact Test

Used when you want to conduct a chi-square test, but one or more of your cells has an expected frequency of 5 or lessFisher's exact test has no such assumption

and can be used regardless of how small the expected frequency is

proc freq data = mydata; tables school*race / fisher; run;

10

Factorial Logistic Regression

Used when you have two or more categorical independent variables but a dichotomous dependent variable The desc option on the proc logistic statement is

necessary so that SAS models the odds of being female (i.e., female = 1). The expb option on the model statement tells SAS to show the exponentiated coefficients (i.e., the odds ratios).

proc logistic data = mydata desc; class prog schtyp; model female = prog schtyp prog*schtyp / expb; run;

11

Nonparametric Correlation

Spearman correlation is used when one or both of the variables are not assumed to be normally distributed and interval (but are assumed to be ordinal)

The values of the variables are converted in ranks and then correlated

The spearman option on the proc corr statement tells SAS to perform a Spearman rank correlation instead of a Pearson correlation

proc corr data = mydata spearman;var read write; run;

Nonparametric Tests: Advantages & Shortcomings

13

Common Nonparametric Tests

Normal theory based test

Corresponding nonparametric test

Purpose of test

t test for independent samples

Mann-Whitney U; Wilcoxon rank-sum

Compares two independent samples

Paired t test Wilcoxon matched pairs signed-rank

Examines a set of differences

Pearson correlation coefficient

Spearman rank correlation coefficient

Assesses the linear association between two variables

One way analysis of variance (F test)

Kruskal-Wallis analysis of variance by ranks

Compares three or more groups

Two-way analysis of variance

Friedman two-way analysis of variance

Compares groups classified by two different factors

14

Why Use Nonparametric Tests?

When data are not normally distributed and the measurements at best contain rank-order information, computing the standard descriptive statistics (e.g., mean, standard deviation) is sometimes not the most informative way to summarize the data

15

Advantages of Nonparametrics

1. Nonparametric test make less stringent demands of the data (resistant to outliers, shape of distribution)

2. Nonparametric procedures can sometimes be used to get a quick answer with little calculation

3. Nonparametric methods provide an air of objectivity when there is no reliable (universally recognized) underlying scale for the original data

16

Why Not Use All the Time?

Parametric tests are often preferred because: They are robust They have greater power efficiency (greater

power relative to the sample size) They provide unique information (e.g., the

interaction in a factorial design) Parametric and nonparametric tests often

address two different types of questions

17

Different Nonparametric Tests Same Results? Different nonparametric tests may yield

different resultsAdvisable to run different nonparametric tests

18

Large Data Sets

Nonparametric methods are most appropriate when the sample sizes are small When data set is large it often makes little sense to

use nonparametric statistics at all

How large is large enough?

19

Small Data Sets

What happens when you use a nonparametric test with data from a normal distribution?Greater incidence of Type II errorThe nonparametric tests lack statistical power

with small samples

20

Shortcomings

Non-parametric tests cannot give very significant results for very small samples as all the possible rank-sums are fairly likely

They do not, without the addition of extra assumptions, give confidence intervals for the means or medians of the underlying distributions

Assume that the data can be ordered – power of the test diminished if there are lots of ties

21

StandardizationUsed in situations in which you need to adjust and rescaleobservations to have a different mean and standard deviation

Example: Midterm test scores are to be rescaled to have a mean of 75 and a standard deviation of 10

data midterm; input grade @@; datalines; 64 71 80 69 55 84 77 63 68 90 66 61 84 43 80 66 68 89 71 59 52 62 60 79 43 63 68 72 60 77 80 73 40 74 63 68 95 66 59 70 73 62 64 62 77 81 73 64 82 59 84 70 70 71 49 90 84 66 69 62;proc univariate data=midterm plot; var grade; run;

22

Moments

N 60 Mean 69.06667 Std Dev 11.60489

Stem Leaf # Boxplot 9 5 1 | 9 00 2 | 8 9 1 | 8 000124444 9 | 7 7779 4 +-----+ 7 00011123334 11 | | 6 6666888899 10 *--+--* 6 0012222333444 13 +-----+ 5 5999 4 | 5 2 1 | 4 9 1 | 4 033 3 | ----+----+----+----+ Multiply Stem.Leaf by 10**+1

23

proc standard data=midterm out=adjusted mean=75 std=10; var grade; run;

The new data set, ADJUSTED, has one variable, also called GRADE, which has a mean of 75 and a standard deviation of 10. For example, the grade of 95 in the MIDTERM dataset becomes a grade of 97.2 in the ADJUSTED dataset. 0.86(95-69.1) +75=97.2

6048911

0666769

x

x

.

.

010

075

y

y

.

.

0750666769x6048911

10xy yx

x

y ...

24

The STDIZE procedure standardizes one or more numeric variables in a SAS data set by subtracting a location measure and dividing by a scale measure. A variety of location and scale measures are provided, including estimates that are resistant to outliers and clustering. Some of the well-known standardization methods such as mean, median, std, range, Huber's estimate, Tukey's biweight estimate, and Andrew's wave estimate are available in the STDIZE procedure.

In addition, you can multiply each standardized value by a constant and add a constant. Thus, the final output value is

result = add + multiply ×[((original - location))/scale]

where

result = final output value

add = constant to add (ADD= option)

multiply = constant to multiply by (MULT= option)

original = original input value

location = location measure

scale = scale measure

PROC STDIZE can also find quantiles in one pass of the data, a capability that is especially useful for very large data sets. With such data sets, the UNIVARIATE procedure may have high or excessive memory or time requirements.

25

options ls=78 ps=200 nocenter nodate;data midterm; input grade @@; datalines; 64 71 80 69 55 84 77 63 68 90 66 61 84 43 80 66 68 89 71 59 52 62 60 79 43 63 68 72 60 77 80 73 40 74 63 68 95 66 59 70 73 62 64 62 77 81 73 64 82 59 84 70 70 71 49 90 84 66 69 62;run;proc stdize data=midterm out=adjusted method=std add=75 mult=10 ; var grade; run;proc print data=adjusted; run;proc univariate data=adjusted; var grade; run;

26

One-Sample Tests of Location

What is the equivalent of the one-sample normal test or one-sample t test for the hypothesis that the true mean is equal to a specified value?

oA

oo

H

H

:

:

Sign TestWilcoxon Sign Rank Test

oA

oo

H

H

:

:

Where is the (unknown) median of the population.

27

PROC UNIVARIATE in SAS automatically performs three tests of location but it does so by testing if the "typical" value is zero.

data midterm; input grade @@;/* If GRADE is typically 75, then GRADE-75 should typically be zero. */ diff=grade-75; label diff='Points above 75'; datalines; 64 71 80 69 55 84 77 63 68 90 66 61 84 43 80 66 68 89 71 59 52 62 60 79 43 63 68 72 60 77 80 73 40 74 63 68 95 66 59 70 73 62 64 62 77 81 73 64 82 59 84 70 70 71 49 90 84 66 69 62; run;proc univariate data=midterm; var diff; run;

T- TestSign TestWilcoxon Sign Rank Test

28

Univariate Procedure

Variable=DIFF Points above 75

Moments

N 60 Sum Wgts 60 Mean -5.93333 Sum -356 Std Dev 11.60489 Variance 134.6734 Skewness -0.17772 Kurtosis 0.250711 USS 10058 CSS 7945.733 CV -195.588 Std Mean 1.498185 T:Mean=0 -3.96035 Pr>|T| 0.0002 Num ^= 0 60 Num > 0 17 M(Sign) -13 Pr>=|M| 0.0011 Sgn Rank -481.5 Pr>=|S| 0.0002

T-test (=75)

Sign Test (=75)

Wilcoxon Sign Rank Test (=75)

29

Ranking

Many nonparametric procedures rely on the relative ordering, or ranking, of the observations.

Suppose a new Florida resident wants to see if the prices of houses in Gainesville are higher than the prices of homes in smaller towns near Gainesville. She collects a random sample of 10 prices for houses on the market in Gainesville and 10 prices of homes in other cities in Alachua County.

30

data homes; input location $ price @@; datalines;Gville 74500 Gville 269000 Gville 94500Gville 86900 Gville 99900Gville 91500 Gville 72000 Gville 78000Gville 289000 Gville 114000County 32000 County 125000 County 105900County 120000 County 139900County 72000 County 85000 County 74500County 199500 County 2200000; run;

Does one location tends to have higher-ranked prices than the other?

Does one location tends to have higher average prices than the other?

Influential observation

0H 21o :

More higher ranking homes in the county than expected at random.

31

How to get the ranks?OBS LOCATION PRICE RANKCOST

1 Gville $74,500 4.5 2 Gville $269,000 18.0 3 Gville $94,500 10.0 4 Gville $86,900 8.0 5 Gville $99,900 11.0 6 Gville $91,500 9.0 7 Gville $70,000 2.0 8 Gville $78,000 6.0 9 Gville $289,000 19.0 10 Gville $114,000 13.0 11 County $32,000 1.0 12 County $125,000 15.0 13 County $105,900 12.0 14 County $120,000 14.0 15 County $139,900 16.0 16 County $72,000 3.0 17 County $85,000 7.0 18 County $74,500 4.5 19 County $199,500 17.0 20 County $2,200,000 20.0

proc rank data=homes out=rankdata ties=mean; var price; ranks rankcost; run;proc print data=rankdata; format price dollar10.; run;

32

Another way to rank the data would be to create groups of least expensive, inexpensive, moderate, expensive, and very expensive price ranges. PROC RANK can do this with the GROUPS option.

OBS LOCATION PRICE PRICEGRP 1 Gville $74,500 1 2 Gville $269,000 4 3 Gville $94,500 2 4 Gville $86,900 1 5 Gville $99,900 2 6 Gville $91,500 2 7 Gville $72,000 0 8 Gville $78,000 1 9 Gville $289,000 410 Gville $114,000 311 County $32,000 012 County $125,000 313 County $105,900 214 County $120,000 315 County $139,900 316 County $72,000 017 County $85,000 118 County $74,500 119 County $199,500 420 County $2,200,000 4

proc rank data=homes out=rankdata groups=5; var price; ranks pricegrp; run;proc print data=rankdata; format price dollar10.; run;

Grouping starts at 0.

33

PROC RANK can be used to produce a better normal probability plot than the one produced by PROC UNIVARIATE.

We use PROC RANK to calculate the normal scores. If the data are indeed normally distributed, the Blomberg-calculated scores

(BLOM option) should provide the best straight line. Consider the 20 homes to be a random sample of all homes for sale in Alachua

County, and we want to see if price or log(price) more closely follows a normal distribution.

data homes; set homes; logprice=log(price); run;proc rank data=homes out=rankdata normal=blom; var price logprice; ranks norm1 norm2; run;proc plot data=rankdata; plot price*norm1 logprice*norm2; run;

34

Plot of PRICE*NORM1. Legend: A = 1 obs, B = 2 obs, etc.

PRICE|3000000+ | | | | | A2000000+ | | | | |1000000+ | | | | A A | A A AA AA A A A A A 0| A B B A +-+------------+------------+------------+------------+- -2 -1 0 1 2

RANK FOR VARIABLE PRICE

Plot of LOGPRICE*NORM2. Legend: A = 1 obs, B = 2 obs, etc.

35

Not very normal looking!

36

LOGPRICE| 16+ | | | | A | 14+ | | | | A A | A 12+ A | AA A A A | B B A A A AA | | | A 10+ ++------------+------------+------------+------------+- -2 -1 0 1 2

RANK FOR VARIABLE LOGPRICE

37

Better, but still not very normal!

38

Comparing Two or More Groups

• The nonparametric version of analysis of variance is based on ranks.

• The Mann-Whitney test and the Wilcoxon rank sum test are equivalent nonparametric techniques to compare two groups, while the Kruskal-Wallis test is ordinarily used to compare three or more groups.

• All of these are available in PROC NPAR1WAY (nonparametric 1-way analysis of variance) in SAS.

39

NPAR1WAYPROC NPAR1WAY performs tests for location and scale differences based on the following scores of a response variable: Wilcoxon, median, Van der Waerden, Savage, Siegel-Tukey, Ansari-Bradley, Klotz, and Mood Scores. Additionally, PROC NPAR1WAY provides tests using the raw data as scores. When the data are classified into two samples, tests are based on simple linear rank statistics. When the data are classified into more than two samples, tests are based on one-way ANOVA statistics. Both asymptotic and exact p-values are available for these tests.

PROC NPAR1WAY also calculates the following empirical distribution function (EDF) statistics: the Kolmogorov- Smirnov statistic, the Cramer-von Mises statistic, and, when the data are classified into only two samples, the Kuiper statistic. These statistics test whether the distribution of a variable is the same across different groups

40

proc npar1way wilcoxon data=homes; class location; var price; run;

N P A R 1 W A Y P R O C E D U R EWilcoxon Scores (Rank Sums) for Variable PRICEClassified by Variable LOCATION Sum of Expected Std Dev Mean LOCATION N Scores Under H0 Under H0 Score

Gville 10 100.500000 105.0 13.2237824 10.0500 Other 10 109.500000 105.0 13.2237824 10.9500Average Scores Were Used for Ties

Wilcoxon 2-Sample Test (Normal Approximation)(with Continuity Correction of .5)

S = 100.500 Z = -.302485 Prob > |Z| = 0.7623

T-Test Approx. Significance = 0.7656

Kruskal-Wallis Test (Chi-Square Approximation)CHISQ = 0.11580 DF = 1 Prob > CHISQ = 0.7336

41

Other Rank Tests

When there are two factors with no interaction, as in a randomized complete block design, Friedman's chi-square test is a non-parametric test that can be used to examine treatment differences. Friedman's test is available in SAS under PROC FREQ, but it is fairly complicated to perform. PROC FREQ also offers versions of rank sum tests for ordinal data. See SAS documentation for details.

42

The LIFETEST procedure can be used with data that may be right-censored to compute nonparametric estimates of the survival distribution and to compute rank tests for association of the response variable with other variables. The survival estimates are computed within defined strata levels, and the rank tests are pooled over the strata and are therefore adjusted for strata differences.

43

Non-Parametric CorrelationThe Pearson Product Moment correlation coefficient measures the strength of the tendency for two variables X and Y to follow a straight line.

Suppose we are more interested in measuring the tendency for X to increase or decrease with Y, without necessarily assuming a strictly linear relationship.

• Spearman correlation coefficient - Pearson correlation between ranks of X and ranks of Y.

• Kendall correlation coefficient - probability of observing Y2 > Y1 when X2 > X1.

44

• Twelve students take a test, and we want to see if the students who finished the test early had higher scores than those who finished later.

• We don't know the exact time that each student spent on the test, but we do know the order in which the tests were turned in to be graded.

• With only have rank data for the time variable, the Pearson linear correlation coefficient would not be appropriate. (The last person to turn in the test could have taken 30 minutes, an hour, or two hours, and the guess of the exact time would greatly influence the resulting Pearson correlation.)

• Both the Spearman and Kendall correlation coefficients could legitimately be used.

data students; input order grade @@; datalines; 1 90 2 74 3 76 4 60 5 68 6 86 7 92 8 60 9 78 10 70 11 78 12 64; run;proc plot data=students; plot grade*order; run;proc corr data=students spearman kendall; var grade; with order; run;

45

Plot of GRADE*ORDER. Legend: A = 1 obs, B = 2 obs, etc.

GRADE| 100+ | | A |A | A | 80+ | A A A | A | A | A | A 60+ A A ++----+----+----+----+----+----+----+----+----+----+----+- 1 2 3 4 5 6 7 8 9 10 11 12

ORDER

47

Correlation Analysis

Spearman Correlation Coefficients / Prob > |R| under Ho: Rho=0/ N = 12

GRADE

ORDER -0.17544 0.5855

Kendall Tau b Correlation Coefficients/ Prob > |R| under Ho: Rho=0 / N = 12

GRADE

ORDER -0.12309 0.5815

experimental design & analysis nonparametric methods april 17, 2007 doctoral seminar, spring...

Documents

test statistic

run slide

fishers exact test

subject data mydata

logistic data

kruskal wallis test

chi square test

freq data