statistical techniques notes and formula sheet exam 2
TRANSCRIPT
Statistical Techniques Notes andFormula Sheet Exam 2
Contents:
a) Formula Sheet
b) ANOVA table
c) Z-Score table
d) Lab Notes
Lab 08: Chi-square Analysis, Hypothesis Test and Confidence Intervals for Pro-portions, Hypothesis testing using T-Test Pooled and Normal
Lab 09: Column Manipulation, ANOVA, Tukey, LSD test, PROG GLM, Leventest
Lab 10: Scatterplot with Regression, Linear Regression, Further Analysis, outputsaving
Lab 11: Correlation between variables, Data Manipulation, Model Fitting withRegression, Model COmparison.
1
Formula Sheet
SUMMARY STATISTICS
Sample mean: µ = x =∑ni=1 xin
Variation: V =∑ni=1(xi−x)2n−1
Standard deviation: s = σ =√V
Median: The middle number or the average middle numberin an ordered listLower Quartile: Q1 25% of the data is below thatUpper Quartile: Q3 75% of the data is below thatInterquartile Range IQR = Q3 −Q1
Min: The minimum number in a datasetMax: The maximum number in a datasetRange: Range=Max-MinUpper Whisker: UW = Q3 + 1.5 ∗ IQRLower Whisker: LW = Q1 − 1.5 ∗ IQRMode: The most frequently occurring valueGeometric mean: The n th root of the product of thevalues of the n observations.Midrange: The mean of the smallest and largest observedvaluesCoefficient of Variation: CV = s
x ∗ 100%
PROBABILITY
Frequentist probability: P (A) = #times event A occurs#experiments conducted
Probability of union: P (A∪B) = P (A)+P (B)−P (A∩B)Probability of complement: P (AC) = 1− P (A)Probability of intersection (ind.): P (A ∩ B) =P (A) ∗ P (B)Probability of intersection (general): P (A ∩ B) =P (A|B) ∗ P (B)Disjoint Events: P (A ∩B) = 0
Conditional Probability: P (A|B) = P (A∩BP (B)
DISTRIBUTIONS-RANDOM VARIABLES
Expected Value (Discrete): µ = E(X) =∑n
i=1 xi ∗ P (X = xi)Variability (Discrete): σ2 = V (X) =
∑ni=1(xi−E(X))2 ∗P (X = xI)
Standard Deviation: σ =√V (X) = SD(X)
Uniform Distribution, U (Discrete):U = {1, 2, 3, ..., k}, P (U = i) = 1
k , µU = k+12 , σ2U = k2−1
12Bernoulli Distribution, Ber (Discrete):Ber = {S, F}, P (S) = p, P (F ) = 1− pBinomial Distribution, B (Discrete) :P (k successes in n trials) =
(nk
)∗ pk ∗ (1 − p)n−k, µB = n ∗ p
σ2B = n ∗ p ∗ (1− p)Poisson Distribution, P (Discrete) :P (k occurences) = λk∗e−λ
k! , µP = λ, σ2P = λContinuous Distribution: A random variable X together with afunction f : X → R+, such that
∫X f(x)dx = 1 is a continuous
probability distribution.Normal Distribution (Continuous) : N(µ, σ)Standard Normal Distribution (Continuous) : Z(0, 1)T - distribution (Continuous) : T, df = n− 1Chi Square distribution (Continuous) : χ2
k, k = DF
HYPOTHESIS TESTING-CONFIDENCE INTERVALS
Central Limit Theorem: (X) ∼ N(µ, SE = σ√n)
Standard Error: SE = σ√n
Confidence Interval (Z): point estimate± z ∗ ×SEConfidence Interval (T): point estimate± t ∗ ×SEMaximum Error: ME = z ∗ ×SE or ME = t ∗ ×SEOne-sided: If alternative is > use probability “above" the value.If alternative is < 0 use probability “below" the value.If alternative i different use twice the probability of above the valueAlpha: α = 0.05 unless specified otherwiseZ-score: Z = point estimate−null value
SE
T-score: Tdf = point estimate−null valueSE , df = n− 1
CHI-score: χ =∑
( (Oi−Ei)2
Ei
ERRORS-POWER
Type 1 error: Re-ject H0 when H0 istrue. αType 2 error: Failto reject H0 when H0
is false. βPower: Probabilityof correctly rejectingH0. Power = 1− β
Basic Z-scores (2sided)
90%: Z=1.64595%: Z=1.9698%: Z=2.32699%: Z=2.576
1
Formula Sheet
Difference in two means (independent)
T statistic: Tdf = pointestimate−nullvalueSE
Standard Error: SE =
√s21n1
+s22n2
degrees of freedom: df = min(n1 − 1, n2 − 1)CI: point estimate +- ME, ME=t * SE
Difference in two means (pooled)
Pooled Variance Estimate: s2p =(N1−1)∗s21+(n2−1)∗s22
n1+n2−2Pooled SE: SE =
sp√n1+n2
Degrees of freedom pooled: dfp = df1 + df2CI: point estimate +- ME, ME=t * SE
Difference in two proportions
Z statistic: Z = pointestimate−nullvalueSE
Standard Error: SE =√
p1(1−p1)n1
+ p2(1−p2)n2
CI: point estimate +- ME, ME=t * SE
Inferences on Variances
F distribution: F (d1, d2), f(x) =
√(d1∗x)d1∗dd22
(d1∗x+d2)d1+d2
x∗B(d12,d22)
F as ratio of Chi: s21 ∼ χ2(d1), s22 ∼ χ2(d2),⇒ F =
s21s22∼
F (d1, d1)
ANOVA
Sum of Squares: SS =∑
(x− xi)2)2
Pooled Variance: s2p =SSp
dfp=
∑ki=1 SSi∑ki=1 dfi
Bonferroni’s Alpha
M correction: M =(m2
)= m!
2!∗(m−2)!Bonferroni alpha: αBon = α
M
Linear Regression
Average Formula: µ(y|x) = β0 + β1 ∗ xPrediction Formula: y = β0 + β1 ∗ x+ εSSE(Residuals): SSE =
∑(y − µ(y|x))
Root Mean Square Error (RMSE): RMSE =√
SSEdfreg
β1: β1 =∑
[(x−x)(y−y)∑(x−x)2 ]
β0: β0 = y − β1 ∗ x
Correlation Coefficient
r formula: r =∑
(x−x)(y−y)√∑(x−x)2∗∑(y−y)2
r with SS: r = SSxy√SSxx∗SSyy
r and F: F = (n−2)∗r21−r2
1
DF Sum of Squares Mean Square F Value Pr > F
Between DFB= #gr - 1 SSB= σ𝑛𝑖 ∗ ҧ𝑥𝑖 − ҧ𝑥 2 MSB=SSB/DFB F=MSB/MSW p value
Within DFW= n - #gr SSW=SST-SSB MSW=SSW/DFW
Total DFT=n-1 SST= σ𝑎𝑙𝑙 𝑥𝑖𝑥𝑖 − ҧ𝑥 2
Table entry
Table entry for z is the area under the standard normal curveto the left of z.
Standard Normal Probabilities
z
z .00
–3.4–3.3–3.2–3.1–3.0–2.9–2.8–2.7–2.6–2.5–2.4–2.3–2.2–2.1–2.0–1.9–1.8–1.7–1.6–1.5–1.4–1.3–1.2–1.1–1.0–0.9–0.8–0.7–0.6–0.5–0.4–0.3–0.2–0.1–0.0
.0003
.0005
.0007
.0010
.0013
.0019
.0026
.0035
.0047
.0062
.0082
.0107
.0139
.0179
.0228
.0287
.0359
.0446
.0548
.0668
.0808
.0968
.1151
.1357
.1587
.1841
.2119
.2420
.2743
.3085
.3446
.3821
.4207
.4602
.5000
.0003
.0005
.0007
.0009
.0013
.0018
.0025
.0034
.0045
.0060
.0080
.0104
.0136
.0174
.0222
.0281
.0351
.0436
.0537
.0655
.0793
.0951
.1131
.1335
.1562
.1814
.2090
.2389
.2709
.3050
.3409
.3783
.4168
.4562
.4960
.0003
.0005
.0006
.0009
.0013
.0018
.0024
.0033
.0044
.0059
.0078
.0102
.0132
.0170
.0217
.0274
.0344
.0427
.0526
.0643
.0778
.0934
.1112
.1314
.1539
.1788
.2061
.2358
.2676
.3015
.3372
.3745
.4129
.4522
.4920
.0003
.0004
.0006
.0009
.0012
.0017
.0023
.0032
.0043
.0057
.0075
.0099
.0129
.0166
.0212
.0268
.0336
.0418
.0516
.0630
.0764
.0918
.1093
.1292
.1515
.1762
.2033
.2327
.2643
.2981
.3336
.3707
.4090
.4483
.4880
.0003
.0004
.0006
.0008
.0012
.0016
.0023
.0031
.0041
.0055
.0073
.0096
.0125
.0162
.0207
.0262
.0329
.0409
.0505
.0618
.0749
.0901
.1075
.1271
.1492
.1736
.2005
.2296
.2611
.2946
.3300
.3669
.4052
.4443
.4840
.0003
.0004
.0006
.0008
.0011
.0016
.0022
.0030
.0040
.0054
.0071
.0094
.0122
.0158
.0202
.0256
.0322
.0401
.0495
.0606
.0735
.0885
.1056
.1251
.1469
.1711
.1977
.2266
.2578
.2912
.3264
.3632
.4013
.4404
.4801
.0003
.0004
.0006
.0008
.0011
.0015
.0021
.0029
.0039
.0052
.0069
.0091
.0119
.0154
.0197
.0250
.0314
.0392
.0485
.0594
.0721
.0869
.1038
.1230
.1446
.1685
.1949
.2236
.2546
.2877
.3228
.3594
.3974
.4364
.4761
.0003
.0004
.0005
.0008
.0011
.0015
.0021
.0028
.0038
.0051
.0068
.0089
.0116
.0150
.0192
.0244
.0307
.0384
.0475
.0582
.0708
.0853
.1020
.1210
.1423
.1660
.1922
.2206
.2514
.2843
.3192
.3557
.3936
.4325
.4721
.0003
.0004
.0005
.0007
.0010
.0014
.0020
.0027
.0037
.0049
.0066
.0087
.0113
.0146
.0188
.0239
.0301
.0375
.0465
.0571
.0694
.0838
.1003
.1190
.1401
.1635
.1894
.2177
.2483
.2810
.3156
.3520
.3897
.4286
.4681
.0002
.0003
.0005
.0007
.0010
.0014
.0019
.0026
.0036
.0048
.0064
.0084
.0110
.0143
.0183
.0233
.0294
.0367
.0455
.0559
.0681
.0823
.0985
.1170
.1379
.1611
.1867
.2148
.2451
.2776
.3121
.3483
.3859
.4247
.4641
.01 .02 .03 .04 .05 .06 .07 .08 .09
Table entry
Table entry for z is the area under the standard normal curveto the left of z.
z
z .00
0.00.10.20.30.40.50.60.70.80.91.01.11.21.31.41.51.61.71.81.92.02.12.22.32.42.52.62.72.82.93.03.13.23.33.4
.5000
.5398
.5793
.6179
.6554
.6915
.7257
.7580
.7881
.8159
.8413
.8643
.8849
.9032
.9192
.9332
.9452
.9554
.9641
.9713
.9772
.9821
.9861
.9893
.9918
.9938
.9953
.9965
.9974
.9981
.9987
.9990
.9993
.9995
.9997
.5040
.5438
.5832
.6217
.6591
.6950
.7291
.7611
.7910
.8186
.8438
.8665
.8869
.9049
.9207
.9345
.9463
.9564
.9649
.9719
.9778
.9826
.9864
.9896
.9920
.9940
.9955
.9966
.9975
.9982
.9987
.9991
.9993
.9995
.9997
.5080
.5478
.5871
.6255
.6628
.6985
.7324
.7642
.7939
.8212
.8461
.8686
.8888
.9066
.9222
.9357
.9474
.9573
.9656
.9726
.9783
.9830
.9868
.9898
.9922
.9941
.9956
.9967
.9976
.9982
.9987
.9991
.9994
.9995
.9997
.5120
.5517
.5910
.6293
.6664
.7019
.7357
.7673
.7967
.8238
.8485
.8708
.8907
.9082
.9236
.9370
.9484
.9582
.9664
.9732
.9788
.9834
.9871
.9901
.9925
.9943
.9957
.9968
.9977
.9983
.9988
.9991
.9994
.9996
.9997
.5160
.5557
.5948
.6331
.6700
.7054
.7389
.7704
.7995
.8264
.8508
.8729
.8925
.9099
.9251
.9382
.9495
.9591
.9671
.9738
.9793
.9838
.9875
.9904
.9927
.9945
.9959
.9969
.9977
.9984
.9988
.9992
.9994
.9996
.9997
.5199
.5596
.5987
.6368
.6736
.7088
.7422
.7734
.8023
.8289
.8531
.8749
.8944
.9115
.9265
.9394
.9505
.9599
.9678
.9744
.9798
.9842
.9878
.9906
.9929
.9946
.9960
.9970
.9978
.9984
.9989
.9992
.9994
.9996
.9997
.5239
.5636
.6026
.6406
.6772
.7123
.7454
.7764
.8051
.8315
.8554
.8770
.8962
.9131
.9279
.9406
.9515
.9608
.9686
.9750
.9803
.9846
.9881
.9909
.9931
.9948
.9961
.9971
.9979
.9985
.9989
.9992
.9994
.9996
.9997
.5279
.5675
.6064
.6443
.6808
.7157
.7486
.7794
.8078
.8340
.8577
.8790
.8980
.9147
.9292
.9418
.9525
.9616
.9693
.9756
.9808
.9850
.9884
.9911
.9932
.9949
.9962
.9972
.9979
.9985
.9989
.9992
.9995
.9996
.9997
.5319
.5714
.6103
.6480
.6844
.7190
.7517
.7823
.8106
.8365
.8599
.8810
.8997
.9162
.9306
.9429
.9535
.9625
.9699
.9761
.9812
.9854
.9887
.9913
.9934
.9951
.9963
.9973
.9980
.9986
.9990
.9993
.9995
.9996
.9997
.5359
.5753
.6141
.6517
.6879
.7224
.7549
.7852
.8133
.8389
.8621
.8830
.9015
.9177
.9319
.9441
.9545
.9633
.9706
.9767
.9817
.9857
.9890
.9916
.9936
.9952
.9964
.9974
.9981
.9986
.9990
.9993
.9995
.9997
.9998
.01 .02 .03 .04 .05 .06 .07 .08 .09
Standard Normal Probabilities
Lab #8: Chi-Square-Confidence Intervals-Hypothesis Testing (3)
Objectives:
1. Chi-Square Analysis examples on Contingency Tables
2. Hypothesis Test and Confidence Intervals for Proportions
3. Hypothesis testing using T-Tests, Pooled and Normal
Chi-Square Analysis
Suppose that we want to analyze the following contingency table:
In order to import a contingency table in SAS we need to create a new dataset with the two categorical
variables and all the possible cases. We also need to define a dummy variable to hold all the numerical
entries. We call that variable counts. To create the dataset we use the code:
DATA Vaccines;
INPUT Autism $ Vaccinated $ count;
datalines;
Yes Yes 440
Yes No 184
No Yes 1240
No No 584
;
Notice the use of $ after the variables Autism and Vaccinated. Those are there to remind SAS that these
are categorical variables. Count, being a numerical variable, does not need a $ after it.
This has created and stored our dataset. In order for us to turn it into a contingency table and do the chi
square analysis we need to use the procedure freq. as follows:
TITLE ‘Contingency table and chi square analysis’;
PROC FREQ data=Vaccines ORDER=DATA;
tables Autism*Vaccinated/ chisq expected norow nocol nopercent;
weight count;
RUN;
The command: tables Autism*Vaccinated instructs SAS to create a contingency table using the two
variables Autism and Vaccinated. The rest of this command works as follows:
chisq : do the chi square analysis
expected: compute and display the expected values for each cell
norow: do not show the row marginal probabilities
nocol: do not show the column marginal probabilities
nopercent: do not show general probabilities
The command ORDER=DATA tells the procedure freq to present the data in the order we put them in (so
no flipping the YES and NO anymore!).
The output is something like this:
Frequency
Expected
Table of Autism by Vaccinated
Autism Vaccinated
no yes Total
no 584
572.24
1240
1251.8
1824
yes 184
195.76
440
428.24
624
Total 768
1680
2448
And the statistics table is:
Statistic DF Value Prob
Chi-Square 1 1.3827 0.2396
Likelihood Ratio Chi-Square 1 1.3931 0.2379
Continuity Adj. Chi-Square 1 1.2676 0.2602
Mantel-Haenszel Chi-Square 1 1.3821 0.2397
Statistic DF Value Prob
Phi Coefficient 0.0238
Contingency Coefficient 0.0238
Cramer's V 0.0238
Once again out of all the tests we focus on the first one. We say thus that the p value is 0.2396 which is
much larger than 0.05. So we fail to reject the null hypothesis. This dataset does not provide enough
evidence to support the claim that Autism and Vaccines are dependent.
Hypothesis Testing and confidence intervals for proportions
In the following section we will explore with sas the idea of a hypothesis testing with Z-scores for
proportions.
It is estimated that 8% of the population are colorblind. Pingelap, is a south pacific island with strange
nickname. It is called “the color blind island”, since a 26.4% of the population, 66 people, is colorblind and
almost everybody has some sort of color deficiency. Let’s test if the two ratios are truly different if we
know that the island has 250 people.
First we need to create a dataset in sas. Let’s call it Color and have two variables one for location and the
other for states of colorblindness.
DATA Color;
INPUT Location $ Colorblindness $ count;
datalines;
World Yes 8
World No 92
Pingelap Yes 26.4
Pingelap No 73.6
;
This has created our dataset. Now to view it and do the z-test comparison of the ratios we need to use
the following code:
TITLE "Z score for proportions test and 95% confidence ";
PROC FREQ data=Color ORDER=DATA;
tables Location*Colorblindness/ riskdiff(equal var=null cl=wald) norow nocol
nopercent alpha=0.05;
weight count;
RUN;
Once again the procedure frequency will output the table in a nice 2 by 2 format and once more we have
norow: do not show the row marginal probabilities
nocol: do not show the column marginal probabilities
nopercent: do not show general probabilities
The creation of the z-tables and happens because of the command riskdiff. In the documentation for this
command in SAS we learn that the binomial proportions are called "risks," so a "risk difference" is a
difference in proportions.
The command equal var=null tells sas that we don’t expect the variability of the two proportions to be
equal and the command cl=wald instructs it to use the Wald Method to compute the standard error, which
is the formula we learned in class i.e.
𝑆𝐸 = √𝑝1(1 − 𝑝1)
𝑛1+𝑝2(1 − 𝑝2)
𝑛2
You can learn more about the procedure freq here:
http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_freq_s
yntax08.htm
As you can see the 95% interval is computed as well for the difference of the two proportions. This is
obtained by adjusting the command alpha=0.05.
Let’s change that to 0.1 and see what happens with the code:
TITLE "Z score for proportions test and 90% confidence ";
PROC FREQ data=Color ORDER=DATA;
tables Location*Colorblindness/ riskdiff(equal var=null cl=wald) norow nocol
nopercent alpha=0.01;
weight count;
RUN;
Obviously there is no change in the proportions difference test and we get:
Proportion (Risk) Difference Test
H0: P1 - P2 = 0 Wald Method
Proportion Difference -0.1840
ASE (H0) 0.0534
Z -3.4477
One-sided Pr < Z 0.0003
Two-sided Pr > |Z| 0.0006
Column 1 (Colorblindness = Yes)
What changes in the confidence limits which gives us now:
Confidence Limits for the
Proportion (Risk) Difference
Proportion Difference = -0.1840
Type 90% Confidence Limits
Wald -0.2691 -0.0989
Column 1 (Colorblindness = Yes)
Hypothesis testing using T-Tests, Pooled and Normal
As you recall, the PROC IMPORT statement is the best way to enter external data sets. The CSV file we
will be using is called “Qs3small.csv”. Download and save it in your favorite folder and mark the complete
path to it. Then use the following code to import it, making sure you put the correct path on the DATAFILE
argument. Save the dataset as Qs3.
PROC IMPORT OUT= Qs3
DATAFILE= "put the complete path here…\Qs3small.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;
This dataset contains some of the results from questionnaire 3. Let’s try to see if there is any difference
of the reported understanding of the lectures based on gender. To keep things tidy lets first sort our
dataset based on gender using the code:
PROC SORT Data=Qs3;
by Gender;
RUN;
And then lets print our dataset with proc print.
PROC PRINT Data=Qs3;
RUN;
Let’s also do some basic summary statistics on the average understanding of the whole population using
proc univariate as follows:
TITLE "Summary statistics for Lectures, all students";
PROC UNIVARIATE Data=Qs3 plot;
var Lectures;
RUN;
We notice that the total mean is 3.6951 and the standard deviation is 0.757. We proceed to compute the
summary statistics based on different genders with the slightly modified proc univariate
TITLE "Summary statistics for Lectures, by gender ";
PROC UNIVARIATE Data=Qs3 plot;
var Lectures;
by Gender;
RUN;
Notice that instead of getting 2 categories (M, F) we got 3! Fear not, one corresponds to one observation
that was missing the gender, so it became a group on its own. Gender=F
Moments
N 18 Sum Weights 18
Mean 3.61111111 Sum Observations 65
Std Deviation 0.63142126 Variance 0.39869281
Skewness 0.51682132 Kurtosis -0.6481221
Uncorrected SS 241.5 Corrected SS 6.77777778
Coeff Variation 17.4855119 Std Error Mean 0.14882742
Gender=M
Moments
N 22 Sum Weights 22
Mean 3.75 Sum Observations 82.5
Std Deviation 0.86945522 Variance 0.75595238
Skewness -0.5080564 Kurtosis 0.6958176
Uncorrected SS 325.25 Corrected SS 15.875
Coeff Variation 23.1854726 Std Error Mean 0.18536848
As we notice, we need to compare the two means, but we have different number of observations and
different standard deviations. Fear not! SAS will do the computation for us immediately.
But,first we need to take care of that missing value otherwise SAS will not know which groups to compare.
To do that we will create a new set called Qs3clean, by removing that observation without a gender entry
as follows:
DATA Qs3Clean;
SET Qs3;
if Gender = ' ' then delete;
RUN;
As you can see the new dataset Qs3Clean is created by using the original set Qs3 using the if statement,
that says if the variable gender is empty then delete it (very logical!)
Let’s see how this new dataset looks like with proc print.
TITLE "Questionnaire 3 without missing Gender";
PROC PRINT Data=Qs3Clean;
RUN;
So now we can continue with the ttest analysis which is basically the following simple code:
TITLE "T-test comparison between Lectures for different genders";
ods graphics on;
PROC TTEST Data=Qs3Clean cochran ci=equal umpu;
class Gender;
var Lectures;
RUN;
ods graphics off;
Notice that we need to define the variable for which we are doing the comparison as class and the
numerical variable as var (usual)
Also I turned on the graphics just in case I had turned it off at some point, to get all the graphs produced.
Finally the commands Cochran ci=equal umpu instruct SAS to unclude the Cochran approximation of
confidence intervals and ttest (another method of computation more “datadriven”) and ci=equal tells sas
to create an equal tailed confidence interval (default). The UMPU specifies an interval using the “best”
variance.
There are various different outputs popping up. The following table contains the summary statistics for
the two genders and their computed difference
Gender N Mean Std Dev Std Err Minimum Maximum
F 18 3.6111 0.6314 0.1488 3.0000 5.0000
M 22 3.7500 0.8695 0.1854 1.5000 5.0000
Diff (1-2) -0.1389 0.7721 0.2454
Then we get 95% confidence intervals for F, M and the difference if we used pooled standard errors or
not (Satterthwaite method)
Gender Method Mean 95% CL Mean Std Dev 95% CL Std Dev 95% UMPU CL Std Dev
F 3.6111 3.2971 3.9251 0.6314 0.4738 0.9466 0.4654 0.9257
M 3.7500 3.3645 4.1355 0.8695 0.6689 1.2425 0.6593 1.2206
Diff (1-2) Pooled -0.1389 -0.6356 0.3579 0.7721 0.6310 0.9951 0.6258 0.9856
Diff (1-2) Satterthwaite -0.1389 -0.6203 0.3425
Finally we get the corresponding p values for a two sided ttest comparison and again we get all cases
including a new one:
Method Variances DF t Value Pr > |t|
Pooled Equal 38 -0.57 0.5747
Satterthwaite Unequal 37.534 -0.58 0.5625
Cochran Unequal . -0.58 0.5658
The pooled is the one we did in class and Satterhwaite is the standard with the uneven ones. Cochran, is
another method of computing p-values using a completely different approximation that computes
variances and standard errors through the data.
We will focus on the first two depending on what the problem calls for. In this case though any of the
three is ok. And even if we did a one-sided hypothesis (halving all of them) the p values would be much
bigger than 0.05, so we fail to reject the null hypothesis. In other words the dataset does not provide
enough evidence to show a difference in understanding of the Lecture between genders.
Finally the table
Equality of Variances
Method Num DF Den DF F Value Pr > F
Folded F 21 17 1.90 0.1847
contains the F value and the corresponding p value for the comparison of the two variances. Notice that
we also get the histograms for free with the QQ-plots in case we are wondering if the results follow a
normal distribution.
To learn more about the ttest procedure follow this link:
https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_ttest_
sect002.htm
Lab #9: ANOVA and TUKEY tests
Objectives:
1. Column manipulation in SAS
2. Analysis of variance
3. Tukey test
4. Least Significant Difference test
5. Analysis of variance with PROC GLM
6. Levene test for Homoscedasticity
Column Manipulation
Analysis of variance is used when 3 or more different groups of observations need to be compared with
each other. The best scenario is for example comparing 3 or more different treatments on a population,
or 2 or more treatments and a control, in an experiment, where the number of samples is the same in
each group and independence within and between is controlled.
One such example is the dataset Diets.csv which you can find on our MOODLE page and contains the
results of weight loss for 75 individuals that followed 3 different programs. The question for the researcher
is first, are the outputs of the programs different? If yes which one seems to be better?
First let’s import our dataset by using the by now well-known proc import command as follows:
PROC IMPORT OUT= Diets
DATAFILE= "put the complete path here…\Diets.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;
Using proc print we can view the dataset and notice that it contains the observation number, the gender
(0 female, 1 male) the age, the height and the diet number (1,2,3). It also contains 2 columns the
pre_weight and post_weight measured in kilos.
PROC PRINT DATA = Diets;
RUN;
What we would like to have is the difference of weight before and after the diet, so we can compare those
averages amongst the three diets. Thus we will create a new dataset, called DietClean that has the
difference in weight, and just the type of diet the person followed, disregrading all other information. To
do that we will use the same subseting tools we learned in the past, but this time we will create a new
column by doing arithmetics on variables, namely:
DATA DietsClean;
SET Diets;
WeightDif = pre_weight-post_weight;
KEEP Diet WeightDif;
RUN;
Notice that we created a new set by invoking the command DATA followed by the name of the new
dataset, in this case DietsClean. The command SET Diets instructs SAS to use the dataset Diets to create
the new columns. The line WeightDif = pre_weight-post_weight creates a new variable (column) called
WeightDif by subtracting the variable post_weight from the variable pre_weight. It is neat that SAS
understands column-wise operations! Finally with the command KEEP Diet WeightDif we instruct SAS to
keep only those two variables (columns) in the new dataset we created and discard the rest.
Go ahead and use proc print to see how the new dataset looks like.
PROC PRINT DATA = DietsClean;
RUN;
Before we do ANOVA, let’s explore our dataset, and do summary statistics for the different diets, including
the boxplots. PROC UNIVARIATE will do the trick once again. Always in these types of analyses start with
some exploratory statistics. It helps justify the need for ANOVA, regression or other “heavy machinery”!
ods graphics on;
TITLE "Statistics By Diet";
PROC UNIVARIATE Data=DietsClean plots;
var WeightDif;
by Diet;
RUN;
As you recall, the command by Diet will present the summary statistics, of the variable of choice (var
WeightDif) for the three different diets. Adding the command plots creates all the histograms and
boxplots, including the side by side boxplot at the end. Already from that plot we can be sure that there
is a difference between the diets and probably the last one is the best. But we will analyze it further to be
absolutely sure. Just to be on the safe side I added the global command ods graphics on, in case you
turned it off earlier on since we do need to see the output plots.
ANALYSIS OF VARIANCE
Once again, remember that our null and alternative hypotheses are
𝐻0: 𝜇1 = 𝜇2 = 𝜇3
𝐻𝐴: 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑒𝑠𝑒 𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑖𝑒𝑠 𝑖𝑠 𝑛𝑜𝑡 𝑡𝑟𝑢𝑒
Notice that all three groups have the same amount of observations (25). This is called a “balanced”
experiment, and it is preferable. ANOVA WILL work on “unbalanced groups” just fine but it is easier to
showcase the validity of it if the groups have the same amount of observations. As a matter of fact, the
ANOVA procedure that we will use works better if you have the same amount of observations, and the
SAS programmers suggest using a different procedure, PROC GLM, when dealing with unbalanced
experiments. We‘ll get to that soon. For now the following code will do the trick:
PROC ANOVA Data=DietsClean;
class Diet;
model WeightDif=Diet;
RUN;
As you can see a very easy code, and it is also easy to explain. First of all we are using the procedure
ANOVA on the dataset DATA=DietsClean. Then we need to use the command class which provides the
variable (column) that contains the different groups. In this case we have the 3 diets (1, 2, 3) found in the
column Diet, hence class Diet is used. Then SAS wants to know what variable to analyze based on what
class, basically how to create the analysis model. The syntax for that is model “the variable of interest” =
“the variable of different classes”.
A couple of tables pop up after the code is implemented. We have the general description:
Statistics By Diet
The ANOVA Procedure
Class Level Information
Class Levels Values
Diet 3 1 2 3
Number of Observations Read 75
Number of Observations Used 75
SAS informs us that it used all the observations (75 out of 75) and that it sees 3 levels (groups) with values
1,2,3 respectively in the column that contains the groups (in this case Diet).
This is followed by the table we analyzed in class, which is the main table you should report when doing
an analysis of variance, namely:
Statistics By Diet
The ANOVA Procedure
Dependent Variable: WeightDif
Source DF Sum of Squares Mean Square F Value Pr > F
Model 2 52.6344000 26.3172000 4.60 0.0131
Error 72 411.6624000 5.7175333
Corrected Total 74 464.2968000
If we focus on the top right cell, we see that our p-value is 0.0131 which is smaller than the default alpha
0.05. So we reject the null hypothesis and claim that at least one of the diets has a different average
weight loss than the rest.
We also get the boxplot of the distributions with the F value and the p value as a nice plot to put in our
posters!
Tukey Test
Let’s proceed now with the post-ANOVA analysis and start with pairwise “adjusted” comparisons, namely
the Tukey test. This can be achieved by extending the ANOVA code and adding the means computation,
namely:
PROC ANOVA Data=DietsClean;
class Diet;
model WeightDif=Diet;
means Diet/ tukey alpha=0.05;
RUN;
The line means Diet/ tukey instructs SAS to compare the means of the groups found in the variable Diet
using the tukey pairwise comparison test we learned in lecture. The confidence level is indicated by the
parameter alpha=0.05.
What we obtain is a “grouping” of means that look the same, vs groupings of means that look different.
In the form of the following table:
Means with the same letter
are not significantly different.
Tukey Grouping Mean N Diet
A 4.9080 25 3
B 3.1680 25 1
B
B 3.0960 25 2
This implies that the average weight-loss for diets 1 and 2 is similar, therefore we group them together
under the arbitrary “B” group symbol. On the other hand, Diet 3 seems different from the other two so it
creates its own grouping, again arbitrarily given the letter “A”.
Least Significant Difference test
Another interesting pairwise comparison we can do is that of the Least Significant Difference (LSD), or the
Fisher procedure as it is known. Remember that this method again performs a T-test comparison, between
the differences of all possible pairs of means using the following formula for the standard error:
𝑆𝐸 = √2 ∗ 𝑀𝑆𝑊
𝑚
Where MSW is the mean square error within groups and m is the number of observations in each group.
Then instead of outputting the p-values it outputs a symbol to indicate significant differences between
the groups and a confidence interval around the pairwise difference between the means. Once again we
need to augment the ANOVA code a bit to get:
PROC ANOVA Data=DietsClean;
class Diet;
model WeightDif=Diet;
means Diet/ lsd cldiff alpha=0.05;
RUN;
Once again, the line means Diet /lsd cldiff alpha=0.05 instructs SAS to pairwise compare the means for
each group found in the variable Diet by using the LSD test. Cldiff will output the confidence interval
differences and alpha=0.05 is the user defined alpha level.
The output table looks like this:
Comparisons significant at the 0.05 level
are indicated by ***.
Diet
Comparison
Difference
Between
Means
Simultaneous 95% Confidence
Limits
3 - 1 1.7400 0.1215 3.3585 ***
3 - 2 1.8120 0.1935 3.4305 ***
1 - 3 -1.7400 -3.3585 -0.1215 ***
1 - 2 0.0720 -1.5465 1.6905
2 - 3 -1.8120 -3.4305 -0.1935 ***
2 - 1 -0.0720 -1.6905 1.5465
As you see all comparisons happen twice, but it is good to have both, depending on which Diet we want
to put first or second in our pairwise comparisons. Again we can see that the comparison between 1 and
2 does not yield a significant difference, so once again diets 1 and 2 are almost the same. Diet 3 on the
other hand creates significant differences from both 1 and 2, as we can see in the highlighted part of the
table.
Analysis of Variance with PROC GLM
So far we have been working with a balanced experiment, where all the subgroups have the same number
of samples. Suppose though we want to use unbalanced groups. For example, let’s explore the dataset
Tomato.csv found on MOODLE. In this dataset the researchers have catalogued 3 versions of the same
tomato breed M82. Besides the regular one, they have created two strands A, B by selective breeding
with other varieties and they have catalogued various characteristics of the roots and plants after a few
weeks of cultivation.
Our question is to check if there is a difference in the average length of the roots among the three varieties
which we will answer by doing an unbalanced one way ANOVA on the last column titled “length”.
Let’s import that dataset
PROC IMPORT OUT= Tomato
DATAFILE= "put the complete path here…\Tomato.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;
Using proc print we can see that the dataset contains 42 (!?!) observations. The main variables of interest
are Genotype and Length. We could create a smaller dataset that contains only these 2 but we will
proceed with all of them for now.
PROC PRINT DATA = Tomato;
RUN;
Before we analyze our data, we should sort it by genotype using the code:
PROC SORT DATA = Tomato;
By = Genotype;
RUN;
Let us try some summary statistics first by the different genotypes using the code:
ods graphics on;
TITLE "Statistics By Genotype";
PROC UNIVARIATE Data=Tomato plots;
var Length;
by Genotype;
RUN;
One of the things we get is that the three groups M82, A and B have different number of observations
(13, 14 and 15 respectively). This could be due to errors in inputting, failures in crops, failures in saving
the data and others. In this case the experimenter could either
a) Make all groups equal by removing the appropriate amount of observations at random from the
groups with extra observations.
b) Utilize an unbalanced one way ANOVA through PROC GLM.
Here we will present how the second procedure is implemented including LSD test. As a matter of fact,
the code is VERY similar to PROC ANOVA, but instead the mean procedure is PROC GLM as follows.
Title “Unbalanced ANOVA for Tomato varieties”;
PROC GLM Data=Tomato;
class Genotype;
model Length=Genotype;
means Genotype/ lds cldiff alpha=0.05;
RUN;
We basically use PROC GLM (general linear model) instead of PROC ANOVA. The procedure GLM is very
useful and can be thought of as model exploration which includes ANOVA as one of its procedures. More
on that later.
Let’s explore the tables now
Class Level Information
Class Levels Values
Genotype 3 Line A Line B M82
Number of Observations Read 42
Number of Observations Used 42
This table gives us general information about our model. We have three different groups, line a, line b and
M82 and we used all 42 observations in the analysis. The familiar table:
Source DF Sum of Squares Mean Square F Value Pr > F
Model 2 25918.388 12959.194 0.45 0.6425
Error 39 1129472.088 28960.823
Corrected Total 41 1155390.476
follows which contains the p value. Since that is large, we fail to reject the null hypothesis and conclude
that there does not seem to be a difference between the average lengths of roots in the three varieties.
The pairwise comparisons using the Fishers LSD test also give no significant pairs (no *** in any
comparison) as we can see here:
Comparisons significant at the 0.05 level are indicated by ***.
Genotype
Comparison
Difference
Between
Means
95% Confidence Limits
Line A - M82 25.54 -104.90 155.97
Line A - Line B 59.71 -68.20 187.63
M82 - Line A -25.54 -155.97 104.90
M82 - Line B 34.18 -98.41 166.76
Line B - Line A -59.71 -187.63 68.20
Line B - M82 -34.18 -166.76 98.41
Levene test for Homoscedasticity:
To test if the variances for the different groups are the same of statistically different we can use again
proc glm to produce the Levene test. The code is similar as before namely:
Title “Anova with Levene test for Tomato varieties”;
PROC GLM Data=Tomato;
class Genotype;
model Length=Genotype;
means Genotype/ HOVTEST alpha=0.05;
RUN;
What was added was the command HOVTEST after the means, which instructs SAS to run a Levene test
after finishing with the ANOVA computations.
After getting again all the information about the ANOVA tables as before we also get the table:
Levene's Test for Homogeneity of Length Variance
ANOVA of Squared Deviations from Group Means
Source DF Sum of Squares Mean Square F Value Pr > F
Genotype 2 1.198E10 5.9899E9 3.02 0.0605
Error 39 7.745E10 1.9859E9
As we can see the p value is slightly larger than 0.05, so we fail to reject the null hypothesis and
conclude (marginally) that the variances amongst the three groups are probably the same.
Lab # 10: Regression Analysis
Objectives:
1. Scatterplot with Regression Line
2. Linear Regression
3. Saving outputs and further analysis
Scatterplot
As mentioned in the lecture, the first thing one should try when
analyzing two numerical variables trying to identify some correlation
between them is a simple scatterplot. We mentioned this process in lab
2 but now it’s a good time to review and expand on the ways of creating
scatterplots.
The example we will use today is the dataset Iris.csv which you can find
on our MOODLE page and contains the measurements of the sepal
length and width of the beautiful wildflower Iris Setosa which you can
see on the left. This is a circum-arctic wildflower found in Alaska,
Northern USA, Russia, Northeastern Asia, China, Korea and Japan and it
is considered an endangered species (critically threatened).
First let’s import our dataset by using the by now well-known proc import command as follows:
PROC IMPORT OUT= Iris
DATAFILE= "put the complete path here…\Iris.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;
Using proc print we can view the dataset and notice that it contains the length and the width of the sepals
measured in centimeters.
PROC PRINT DATA = Iris;
RUN;
Recall, that creating a scatterplot can be done with the PROC GPLOT which we learned in lab 2 as follows:
SYMBOL V=circle C=Blue;
Title “Scatterplot Width vs Length”;
PROC GPLOT Data=Iris;
Plot Width*Length;
RUN;
The SYMBOL command instructs SAS to use blue circles to represent the points. The line Plot
Width*Length says that the Width variable will be represented on the y axis and the variable Length on
the x axis.
So Gplot creates beautiful graphs and it is very versatile, but the best way to create scatterplots is using
the procedure SGSCATTER. This procedure also fits and outputs various lines to our scatterplot including
a linear regression. The code is as follows:
Title “Scatterplot with regression line Width vs Length”;
PROC SGSCATTER Data=Iris;
Plot= Width*Length/ Reg= (nogroup degree=1) grid;
RUN;
The command responsible for creating the
regression is the: Reg = (nogroup degree=1).
This instructs SAS to add a regression line of
degree 1 (a straight line) and to not group the
variables using some other variable. If for
example you had two groups of observations (2
different species) SAS would fit a regression line
to each group and present all of them in a nice
graph. The command grid creates a grid of lines
on the graph so that you can easily see the
values of the points.
Linear Regression (Simple)
We will now try to create a simple linear regression model and analyze all the results. The procedure here
is PROC REG used as follows:
Title “Simple Linear Regression”;
PROC REG Data=Iris;
Model Width=Length;
RUN;
Just like in the PROC GLM the connection between the variables is denoted by the line
Model Width=Length; If no other variables are present, SAS will automatically try a linear regression
between the two variables presented in the model. Let’s now try to explain all the outputs starting by the
analysis table:
Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 3.34457 3.34457 44.57 <.0001
Error 46 3.45210 0.07505
Corrected Total 47 6.79667
Notice that just like in class, the first table presented is an F statistic with the same set up as ANOVA. The
hypotheses here are:
𝑯𝟎: 𝜷𝟏 = 𝟎 , 𝑯𝑨: 𝜷𝟏 ≠ 𝟎
In other words we assume that the slope 𝜷𝟏 in the formula
𝑾𝒊𝒅𝒕𝒉 = 𝜷𝟎 + 𝜷𝟏 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝜺
is zero, which would imply that Length does not help predict the Width. Since our p value is extremely
small we reject the null hypothesis and confirm that length CAN be used to predict width and that the
regression model might be a good predictor model for the relationship between the two.
The second table:
Root MSE 0.27394 R-Square 0.4921
Dependent Mean 3.44167 Adj R-Sq 0.4810
Coeff Var 7.95965
contains the Root MSE which is the square root of the Mean Square error we talked about in class. The
parameter estimates can be found in the table:
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 -0.28624 0.55981 -0.51 0.6116
Length 1 0.74434 0.11150 6.68 <.0001
Based on those results our model becomes:
𝑾𝒊𝒅𝒕𝒉 = −0.28624 + 0.74434 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝜺
The program also gives us the residuals for each observation pair in a scatterplot.
This basically shows how far apart from zero are all the residuals for each point. As we can see there are
somewhat evenly distributed around zero, with a moderate spread. So it seems like our model is a good
fit.
In the fit diagnostics panel we can get a lot more information about the residuals and if we focus on their
histogram (bottom left corner) we can safely say that the residuals DO follow a normal distribution around
zero. (One of the requirements of regression). Further evidence for that is the QQ plot of the residuals
(middle left graph). On the bottom right corner we see a summary of the regression analysis. Another
interesting graph is the middle one. To create it, SAS is using the model and the Length values and predicts
the Width values. Then it plots the predicted values vs the real ones. It would be ideal if all the plotted
points were on the y=x axis, but still our result is not that bad.
Finally we get a much better scatterplot with the 95% confidence interval lines and the 95% Prediction
limits (which we will talk about next week). This is the plot that I prefer putting in my results since it also
has a summary on the side.
Saving outputs and Further Analysis:
So far we have been interested in statistical outputs and graphs coming from SAS. But what if we wanted
to get the output from a tedious computation for further analysis. For example, in the computation of
linear regression we discussed that computing the residuals is one of the most important aspects.
Furthermore, testing the normality of their distribution is crucial. Earlier in the semester we learned a test
for normality with the Kolmogorov-Smirnov test.
The following code will allow us to save the extracted residuals in a new dataset and then apply any
analyses from the ones that we have already learned. The command is output and it can be used in other
SAS procedures not just PROC REG.
Title “Saving the residuals”;
PROC REG Data=Iris;
Model Width=Length;
Output out=Results r=res;
RUN;
The extra line Output out=Results r=res; creates a new dataset utilizing the output of the process it is
attached to. The name of the dataset is Results and it is given by the command out=Results. Since we
want to keep the residuals in the output dataset we need to give the residuals a name in the output
dataset. The way to signify residuals is simply r and so I gave them the name res with the command r=res.
Let’s go ahead and PROC PRINT the dataset results and see what it looks like:
PROC PRINT DATA = Results;
RUN;
The dataset contains the columns of width and length but also the column of the residuals as a new
variable res. We can now apply all the tools we know and especially the normality test we learned in lab
7 namely:
TITLE “How Normal are the residuals?”;
Ods select Histogram ParameterEstimates GoodnessOfFit FitQuantiles Bins;
PROC UNIVARIATE DATA = Results;
HISTOGRAM res/ normal(percents=20 40 60 80 midpercents);
INSET n normal(ksdpval) / pos = ne format =6.3;
RUN;
Since the KS statistic is greater than 0.05 we can safely say that our residuals probably follow a normal
distribution and everything is fine.
Goodness-of-Fit Tests for Normal Distribution
Test Statistic p Value
Kolmogorov-Smirnov D 0.07706671 Pr > D >0.150
Cramer-von Mises W-Sq 0.03377812 Pr > W-Sq >0.250
Anderson-Darling A-Sq 0.22661831 Pr > A-Sq >0.250
For more information on the output functionality for regression in SAS check here:
https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_reg_s
ect015.htm
Lab # 11: Correlation and Model Fitting
Objectives:
1. Correlations between variables
2. Data Manipulation, creation of squares
3. Model fitting with regression
4. Comparison of models
Correlations between variables
Throughout the semester we have learned various methods of trying to identify relationships between
variables. When they were categorical our only option was to check the corresponding contingency table
and try the chi-square analysis. When we were trying to compare a numerical variable with a categorical
one we used TTest and ANOVA. In this section we will explore the idea of comparing two numerical
variables, for which we have a one to one correspondence between each value in our dataset. We will do
that with the use of the correlation coefficient (Pearson) mathematically and a simple scatterplot visually.
We will then use those results to build a model that will predict our output variable using the input
variables as best as possible.
The example we will use today is the dataset Larvae.csv which you can find on our
MOODLE page and contains the measurements of the Headwidth, Length and Mass,
including the gender for the Sirex Noctilio Larvae that we have seen in the past. Our
output variable here will be Mass and we will try to correlate that to Headwidth and
Length respectively. Apparently, Mass is the hardest measurement of the three, since
both Headwidth and Length are performed automatically by a modified photometer-
microscope, so a model to infer Mass based on the other two would be useful.
First let’s import our dataset by using the by now well-known proc import command as
follows:
PROC IMPORT OUT= Larvae
DATAFILE= "put the complete path here…\Larvae.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;
The dataset is quite large, so we will modify a bit the proc print procedure to only output the first 10
observations only
PROC PRINT DATA = Larvae (obs=10);
RUN;
What we would like to do then is compute the correlation coefficients for the pairs Mass with Length and
Mass with Headwidth respectively:
The procedure for that is sas is PROC COR. The following code will compute those correlations and output
them in a nice matrix way
Title “Correlations Mass vs Length and Mass vs Headwidth”;
PROC CORR Data=Larvae;
Var Mass;
With Length Headwidth;
RUN;
We declare the output variable by using “VAR Mass” and the variables we want to compare it with (inputs)
by using “With Length Headwidth”. After a table with simple statistics for each variable, we get the t he
pearson correlations between Mass and the other two in a near looking table.
Pearson Correlation Coefficients, N = 4895
Prob > |r| under H0: Rho=0
Mass
Length 0.83863
<.0001
Headwidth 0.80712
<.0001
As we see, both correlation coefficients are quite high (close to 1) and thus we can assume that Mass is
positively correlated with Length and it is also positively correlated with Headwidth.
Now if one wants to just compute all possible correlations amongst variables, the code is actually just:
Title “Correlations amongst all variables”;
PROC CORR Data=Larvae;
RUN;
As you see if you don’t declare what is your input and what is your output, SAS is going to compute all
pairwise correlations.
Pearson Correlation Coefficients, N = 4895
Prob > |r| under H0: Rho=0
Length Headwidth Mass
Length 1.00000
0.85269
<.0001
0.83863
<.0001
Headwidth 0.85269
<.0001
1.00000
0.80712
<.0001
Mass 0.83863
<.0001
0.80712
<.0001
1.00000
Armed with the knowledge that Mass is correlated with the other two variables, let’s create the
corresponding scatterplots that will showcase that connection using SGSCATTER. First we will see a
relationship between Mass and Length by using the code which will also output the best fit regression
line:
Title “Scatterplot with regression line Mass vs Length”;
PROC SGSCATTER Data=Larvae;
Plot Mass*Length/ Reg= (nogroup degree=1) grid;
RUN;
Then let’s explore the relationship between Mass and Headwidth by the following code
Title “Scatterplot with regression line Mass vs Headwidth”;
PROC SGSCATTER Data=Larvae;
Plot Mass*Headwidth/ Reg= (nogroup degree=1) grid;
RUN;
Since we have information about the gender of the larvae, we could reflect that in our analysis by creating
colored scatterplots and different fits. You can do that by extending the codes using the group=gender
add on and removing the “nogroup” statement from the regression.
Title “Scatterplot with regression line Mass vs Length by genders”;
PROC SGSCATTER Data=Larvae;
Plot Mass*Length/ group=Gender Reg= (degree=1) grid;
RUN;
Similarly for Mass compared to Headwidth:
Title “Scatterplot with regression line Mass vs Headwidth by genders”;
PROC SGSCATTER Data=Larvae;
Plot Mass*Headwidth/ group=Gender Reg= (degree=1) grid;
RUN;
Unfortunately, the multitude of red points overshadows the blue making this harder to read. But still this
is a much better looking picture.
Before we move forward, just by looking at the shape of the scatterplots, it is almost obvious that a higher
degree polynomial would fit a little bit better. Let’s try that with a second degree polynomial for both
independent of genders.
Title “Scatterplot with square regression line Mass vs Length”;
PROC SGSCATTER Data=Larvae;
Plot Mass*Length/ Reg= (nogroup degree=2) grid;
RUN;
Title “Scatterplot with square regression line Mass vs Headwidth”;
PROC SGSCATTER Data=Larvae;
Plot Mass*Headwidth/ Reg= (nogroup degree=2) grid;
RUN;
Besides being IMMENSLY more aesthetically pleasing, this also gives as an idea as to what the right
regression model might be. Namely, perhaps it is the squares of Length and Headwidth that we should be
using to predict the mass, not the Length and Headwidth themselves.
Data Manipulations, creation of Squares of columns
In the previous section we noticed that having the square of Length and Headwidth as input might be a
good idea. Thus we will instruct SAS to expand our dataset by adding two new columns, called Length2
and Headwidth2 corresponding to the squares of Length and Headwidth respectively.
Basically we will create a new dataset called Larvae2 extending the old one as follows:
Data Larvae2;
SET Larvae;
Length2= Length*Length;
Headwidth2=Headwidth*Headwidth;
RUN;
Again, we can view the first 10 observations with
PROC PRINT DATA = Larvae2 (obs=10);
RUN;
The new columns of the squares of Length and Headwidth are now added. Using similar methods we can
create any algebraic expression for our variables or even between variables and save them as new
columns.
Model fitting with regression
Suppose now that we want to find the best regression model that fits our dataset, and use that to predict
the mass given the other information. We will be using the extended dataset Larvae2. We start by a simple
regression between Mass and Length followed by one for Mass and Width.
Title “Linear Regression Mass vs Length”;
PROC REG Data=Larvae2;
Model Mass=Length;
RUN;
Title “Linear Regression Mass vs Headwidth”;
PROC REG Data=Larvae2;
Model Mass=Headwidth;
RUN;
Giving us the following plots
And
Looking at the R-square the regression using the length performs better than that of the Headwidth since
0.7033 is larger than 0.6514. (The closer to 1 the better)
Looking at the MSE the regression using the length again performs better than that of the Headwidth
since 0.0035 is smaller than 0.0041. (The closer to 0 the better)
Thus if we were to use one linear model it should be the one with Length. By looking at both the ANOVAs
we can also say that both Length and Headwith are connected to Mass, although our previous analysis
with the correlations is a stronger indicator.
Finally the corresponding formulas for the regressions are:
𝑴𝒂𝒔𝒔 = −𝟎.𝟏𝟒𝟔𝟎𝟏+ 𝟎. 𝟏𝟔𝟐𝟓 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝜺
And:
𝑴𝒂𝒔𝒔 = −𝟎.𝟐𝟑𝟖𝟒𝟐+ 𝟏. 𝟒𝟖𝟐𝟓𝟏 ∗ 𝑯𝒆𝒂𝒅𝒘𝒊𝒅𝒕𝒉 + 𝜺
Comparison of Models:
The rest of this section is a preview of higher level classes. It also serves to point out that the statistical
analysis that we did in regression translates to other models not just straight lines. We will attempt various
models to predict Mass from the other variables and output a measure to compare them by.
First let us run various models by using different combinations in the line mode Mass= …
a) Mass as a linear function of Length and Headwidth.
This is an example of a multilinear model thus we are aiming for a relationship of the form:
𝑴𝒂𝒔𝒔 = 𝜷𝟎 + 𝜷𝟏 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝜷𝟐 ∗ 𝑯𝒆𝒂𝒅𝒘𝒊𝒅𝒕𝒉 + 𝜺
The appropriate code is
Title “Linear Regression Mass vs Length and Headwidth”;
PROC REG Data=Larvae2;
Model Mass=Length Headwidth;
RUN;
The result is very similar to the one we get from classic regression. The parameter estimates contains the
coefficients we want which are read as follows:
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 -0.20214 0.00322 -62.74 <.0001
Length 1 0.10679 0.00273 39.07 <.0001
Headwidth 1 0.61936 0.02591 23.90 <.0001
𝑴𝒂𝒔𝒔 = −𝟎.𝟐𝟎𝟐𝟏𝟒 + 𝟎. 𝟏𝟎𝟔𝟕𝟗 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝟎. 𝟔𝟏𝟗𝟑𝟔 ∗ 𝑯𝒆𝒂𝒅𝒘𝒊𝒅𝒕𝒉 + 𝜺
What we also need to note here is the R-square value found in the table:
Root MSE 0.05613 R-Square 0.7343
Dependent Mean 0.08785 Adj R-Sq 0.7342
Coeff Var 63.89541
This will be our way of comparing different models. The one with the highest R-square is the one we need.
b) Mass as a function of Length and Length squared.
The formula then is
𝑴𝒂𝒔𝒔 = 𝜷𝟎 + 𝜷𝟏 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝜷𝟐 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉𝟐 + 𝜺
The corresponding code is:
Title “Mass vs Length and Length2”;
PROC REG Data=Larvae2;
Model Mass=Length Length2;
RUN;
Which yields:
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 0.03243 0.00376 8.61 <.0001
Length 1 -0.07522 0.00453 -16.59 <.0001
Length2 1 0.06857 0.00126 54.34 <.0001
And thus:
𝑴𝒂𝒔𝒔 = 𝟎. 𝟎𝟑𝟐𝟒𝟑 − 𝟎. 𝟎𝟕𝟓𝟐𝟐 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝟎. 𝟎𝟔𝟖𝟓𝟕 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉𝟐 + 𝜺
The R-square value is then:
Root MSE 0.04684 R-Square 0.8150
Dependent Mean 0.08785 Adj R-Sq 0.8149
Coeff Var 53.32208
Let us compare these two new models and add the linear regressions from before:
Model Formula R-square
1 𝑴𝒂𝒔𝒔 = −𝟎.𝟏𝟒𝟔𝟎𝟏+ 𝟎.𝟏𝟔𝟐𝟓 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝜺 0.7033 2 𝑴𝒂𝒔𝒔 = −𝟎.𝟐𝟑𝟖𝟒𝟐+ 𝟏.𝟒𝟖𝟐𝟓𝟏 ∗ 𝑯𝒆𝒂𝒅𝒘𝒊𝒅𝒕𝒉 + 𝜺 0.6514 3 𝑴𝒂𝒔𝒔 = −𝟎. 𝟐𝟎𝟐𝟏𝟒 + 𝟎. 𝟏𝟎𝟔𝟕𝟗 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝟎. 𝟔𝟏𝟗𝟑𝟔 ∗ 𝑯𝒆𝒂𝒅𝒘𝒊𝒅𝒕𝒉 + 𝜺 0.7343 4 𝑴𝒂𝒔𝒔 = 𝟎. 𝟎𝟑𝟐𝟒𝟑 − 𝟎. 𝟎𝟕𝟓𝟐𝟐 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝟎. 𝟎𝟔𝟖𝟓𝟕 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉𝟐 + 𝜺 0.8150
As we mentioned earlier the one with the highest R-square value is the one we should be focusing on. So
we say that the last model is the one that predicts Mass the best.