statistical techniques notes and formula sheet exam 2

Statistical Techniques Notes andFormula Sheet Exam 2

Contents:

a) Formula Sheet

b) ANOVA table

c) Z-Score table

d) Lab Notes

Lab 08: Chi-square Analysis, Hypothesis Test and Confidence Intervals for Pro-portions, Hypothesis testing using T-Test Pooled and Normal

Lab 09: Column Manipulation, ANOVA, Tukey, LSD test, PROG GLM, Leventest

Lab 10: Scatterplot with Regression, Linear Regression, Further Analysis, outputsaving

Lab 11: Correlation between variables, Data Manipulation, Model Fitting withRegression, Model COmparison.

1

Formula Sheet

SUMMARY STATISTICS

Sample mean: µ = x =∑ni=1 xin

Variation: V =∑ni=1(xi−x)2n−1

Standard deviation: s = σ =√V

Median: The middle number or the average middle numberin an ordered listLower Quartile: Q1 25% of the data is below thatUpper Quartile: Q3 75% of the data is below thatInterquartile Range IQR = Q3 −Q1

Min: The minimum number in a datasetMax: The maximum number in a datasetRange: Range=Max-MinUpper Whisker: UW = Q3 + 1.5 ∗ IQRLower Whisker: LW = Q1 − 1.5 ∗ IQRMode: The most frequently occurring valueGeometric mean: The n th root of the product of thevalues of the n observations.Midrange: The mean of the smallest and largest observedvaluesCoefficient of Variation: CV = s

x ∗ 100%

PROBABILITY

Frequentist probability: P (A) = #times event A occurs#experiments conducted

Probability of union: P (A∪B) = P (A)+P (B)−P (A∩B)Probability of complement: P (AC) = 1− P (A)Probability of intersection (ind.): P (A ∩ B) =P (A) ∗ P (B)Probability of intersection (general): P (A ∩ B) =P (A|B) ∗ P (B)Disjoint Events: P (A ∩B) = 0

Conditional Probability: P (A|B) = P (A∩BP (B)

DISTRIBUTIONS-RANDOM VARIABLES

Expected Value (Discrete): µ = E(X) =∑n

i=1 xi ∗ P (X = xi)Variability (Discrete): σ2 = V (X) =

∑ni=1(xi−E(X))2 ∗P (X = xI)

Standard Deviation: σ =√V (X) = SD(X)

Uniform Distribution, U (Discrete):U = {1, 2, 3, ..., k}, P (U = i) = 1

k , µU = k+12 , σ2U = k2−1

12Bernoulli Distribution, Ber (Discrete):Ber = {S, F}, P (S) = p, P (F ) = 1− pBinomial Distribution, B (Discrete) :P (k successes in n trials) =

(nk

)∗ pk ∗ (1 − p)n−k, µB = n ∗ p

σ2B = n ∗ p ∗ (1− p)Poisson Distribution, P (Discrete) :P (k occurences) = λk∗e−λ

k! , µP = λ, σ2P = λContinuous Distribution: A random variable X together with afunction f : X → R+, such that

∫X f(x)dx = 1 is a continuous

probability distribution.Normal Distribution (Continuous) : N(µ, σ)Standard Normal Distribution (Continuous) : Z(0, 1)T - distribution (Continuous) : T, df = n− 1Chi Square distribution (Continuous) : χ2

k, k = DF

HYPOTHESIS TESTING-CONFIDENCE INTERVALS

Central Limit Theorem: (X) ∼ N(µ, SE = σ√n)

Standard Error: SE = σ√n

Confidence Interval (Z): point estimate± z ∗ ×SEConfidence Interval (T): point estimate± t ∗ ×SEMaximum Error: ME = z ∗ ×SE or ME = t ∗ ×SEOne-sided: If alternative is > use probability “above" the value.If alternative is < 0 use probability “below" the value.If alternative i different use twice the probability of above the valueAlpha: α = 0.05 unless specified otherwiseZ-score: Z = point estimate−null value

SE

T-score: Tdf = point estimate−null valueSE , df = n− 1

CHI-score: χ =∑

( (Oi−Ei)2

Ei

ERRORS-POWER

Type 1 error: Re-ject H0 when H0 istrue. αType 2 error: Failto reject H0 when H0

is false. βPower: Probabilityof correctly rejectingH0. Power = 1− β

Basic Z-scores (2sided)

90%: Z=1.64595%: Z=1.9698%: Z=2.32699%: Z=2.576

1

Formula Sheet

Difference in two means (independent)

T statistic: Tdf = pointestimate−nullvalueSE

Standard Error: SE =

√s21n1

+s22n2

degrees of freedom: df = min(n1 − 1, n2 − 1)CI: point estimate +- ME, ME=t * SE

Difference in two means (pooled)

Pooled Variance Estimate: s2p =(N1−1)∗s21+(n2−1)∗s22

n1+n2−2Pooled SE: SE =

sp√n1+n2

Degrees of freedom pooled: dfp = df1 + df2CI: point estimate +- ME, ME=t * SE

Difference in two proportions

Z statistic: Z = pointestimate−nullvalueSE

Standard Error: SE =√

p1(1−p1)n1

+ p2(1−p2)n2

CI: point estimate +- ME, ME=t * SE

Inferences on Variances

F distribution: F (d1, d2), f(x) =

√(d1∗x)d1∗dd22

(d1∗x+d2)d1+d2

x∗B(d12,d22)

F as ratio of Chi: s21 ∼ χ2(d1), s22 ∼ χ2(d2),⇒ F =

s21s22∼

F (d1, d1)

ANOVA

Sum of Squares: SS =∑

(x− xi)2)2

Pooled Variance: s2p =SSp

dfp=

∑ki=1 SSi∑ki=1 dfi

Bonferroni’s Alpha

M correction: M =(m2

)= m!

2!∗(m−2)!Bonferroni alpha: αBon = α

M

Linear Regression

Average Formula: µ(y|x) = β0 + β1 ∗ xPrediction Formula: y = β0 + β1 ∗ x+ εSSE(Residuals): SSE =

∑(y − µ(y|x))

Root Mean Square Error (RMSE): RMSE =√

SSEdfreg

β1: β1 =∑

[(x−x)(y−y)∑(x−x)2 ]

β0: β0 = y − β1 ∗ x

Correlation Coefficient

r formula: r =∑

(x−x)(y−y)√∑(x−x)2∗∑(y−y)2

r with SS: r = SSxy√SSxx∗SSyy

r and F: F = (n−2)∗r21−r2

1

DF Sum of Squares Mean Square F Value Pr > F

Between DFB= #gr - 1 SSB= σ𝑛𝑖 ∗ ҧ𝑥𝑖 − ҧ𝑥 2 MSB=SSB/DFB F=MSB/MSW p value

Within DFW= n - #gr SSW=SST-SSB MSW=SSW/DFW

Total DFT=n-1 SST= σ𝑎𝑙𝑙 𝑥𝑖𝑥𝑖 − ҧ𝑥 2

Table entry

Table entry for z is the area under the standard normal curveto the left of z.

Standard Normal Probabilities

z

z .00

–3.4–3.3–3.2–3.1–3.0–2.9–2.8–2.7–2.6–2.5–2.4–2.3–2.2–2.1–2.0–1.9–1.8–1.7–1.6–1.5–1.4–1.3–1.2–1.1–1.0–0.9–0.8–0.7–0.6–0.5–0.4–0.3–0.2–0.1–0.0

.0003

.0005

.0007

.0010

.0013

.0019

.0026

.0035

.0047

.0062

.0082

.0107

.0139

.0179

.0228

.0287

.0359

.0446

.0548

.0668

.0808

.0968

.1151

.1357

.1587

.1841

.2119

.2420

.2743

.3085

.3446

.3821

.4207

.4602

.5000

.0003

.0005

.0007

.0009

.0013

.0018

.0025

.0034

.0045

.0060

.0080

.0104

.0136

.0174

.0222

.0281

.0351

.0436

.0537

.0655

.0793

.0951

.1131

.1335

.1562

.1814

.2090

.2389

.2709

.3050

.3409

.3783

.4168

.4562

.4960

.0003

.0005

.0006

.0009

.0013

.0018

.0024

.0033

.0044

.0059

.0078

.0102

.0132

.0170

.0217

.0274

.0344

.0427

.0526

.0643

.0778

.0934

.1112

.1314

.1539

.1788

.2061

.2358

.2676

.3015

.3372

.3745

.4129

.4522

.4920

.0003

.0004

.0006

.0009

.0012

.0017

.0023

.0032

.0043

.0057

.0075

.0099

.0129

.0166

.0212

.0268

.0336

.0418

.0516

.0630

.0764

.0918

.1093

.1292

.1515

.1762

.2033

.2327

.2643

.2981

.3336

.3707

.4090

.4483

.4880

.0003

.0004

.0006

.0008

.0012

.0016

.0023

.0031

.0041

.0055

.0073

.0096

.0125

.0162

.0207

.0262

.0329

.0409

.0505

.0618

.0749

.0901

.1075

.1271

.1492

.1736

.2005

.2296

.2611

.2946

.3300

.3669

.4052

.4443

.4840

.0003

.0004

.0006

.0008

.0011

.0016

.0022

.0030

.0040

.0054

.0071

.0094

.0122

.0158

.0202

.0256

.0322

.0401

.0495

.0606

.0735

.0885

.1056

.1251

.1469

.1711

.1977

.2266

.2578

.2912

.3264

.3632

.4013

.4404

.4801

.0003

.0004

.0006

.0008

.0011

.0015

.0021

.0029

.0039

.0052

.0069

.0091

.0119

.0154

.0197

.0250

.0314

.0392

.0485

.0594

.0721

.0869

.1038

.1230

.1446

.1685

.1949

.2236

.2546

.2877

.3228

.3594

.3974

.4364

.4761

.0003

.0004

.0005

.0008

.0011

.0015

.0021

.0028

.0038

.0051

.0068

.0089

.0116

.0150

.0192

.0244

.0307

.0384

.0475

.0582

.0708

.0853

.1020

.1210

.1423

.1660

.1922

.2206

.2514

.2843

.3192

.3557

.3936

.4325

.4721

.0003

.0004

.0005

.0007

.0010

.0014

.0020

.0027

.0037

.0049

.0066

.0087

.0113

.0146

.0188

.0239

.0301

.0375

.0465

.0571

.0694

.0838

.1003

.1190

.1401

.1635

.1894

.2177

.2483

.2810

.3156

.3520

.3897

.4286

.4681

.0002

.0003

.0005

.0007

.0010

.0014

.0019

.0026

.0036

.0048

.0064

.0084

.0110

.0143

.0183

.0233

.0294

.0367

.0455

.0559

.0681

.0823

.0985

.1170

.1379

.1611

.1867

.2148

.2451

.2776

.3121

.3483

.3859

.4247

.4641

.01 .02 .03 .04 .05 .06 .07 .08 .09

Table entry

Table entry for z is the area under the standard normal curveto the left of z.

z

z .00

0.00.10.20.30.40.50.60.70.80.91.01.11.21.31.41.51.61.71.81.92.02.12.22.32.42.52.62.72.82.93.03.13.23.33.4

.5000

.5398

.5793

.6179

.6554

.6915

.7257

.7580

.7881

.8159

.8413

.8643

.8849

.9032

.9192

.9332

.9452

.9554

.9641

.9713

.9772

.9821

.9861

.9893

.9918

.9938

.9953

.9965

.9974

.9981

.9987

.9990

.9993

.9995

.9997

.5040

.5438

.5832

.6217

.6591

.6950

.7291

.7611

.7910

.8186

.8438

.8665

.8869

.9049

.9207

.9345

.9463

.9564

.9649

.9719

.9778

.9826

.9864

.9896

.9920

.9940

.9955

.9966

.9975

.9982

.9987

.9991

.9993

.9995

.9997

.5080

.5478

.5871

.6255

.6628

.6985

.7324

.7642

.7939

.8212

.8461

.8686

.8888

.9066

.9222

.9357

.9474

.9573

.9656

.9726

.9783

.9830

.9868

.9898

.9922

.9941

.9956

.9967

.9976

.9982

.9987

.9991

.9994

.9995

.9997

.5120

.5517

.5910

.6293

.6664

.7019

.7357

.7673

.7967

.8238

.8485

.8708

.8907

.9082

.9236

.9370

.9484

.9582

.9664

.9732

.9788

.9834

.9871

.9901

.9925

.9943

.9957

.9968

.9977

.9983

.9988

.9991

.9994

.9996

.9997

.5160

.5557

.5948

.6331

.6700

.7054

.7389

.7704

.7995

.8264

.8508

.8729

.8925

.9099

.9251

.9382

.9495

.9591

.9671

.9738

.9793

.9838

.9875

.9904

.9927

.9945

.9959

.9969

.9977

.9984

.9988

.9992

.9994

.9996

.9997

.5199

.5596

.5987

.6368

.6736

.7088

.7422

.7734

.8023

.8289

.8531

.8749

.8944

.9115

.9265

.9394

.9505

.9599

.9678

.9744

.9798

.9842

.9878

.9906

.9929

.9946

.9960

.9970

.9978

.9984

.9989

.9992

.9994

.9996

.9997

.5239

.5636

.6026

.6406

.6772

.7123

.7454

.7764

.8051

.8315

.8554

.8770

.8962

.9131

.9279

.9406

.9515

.9608

.9686

.9750

.9803

.9846

.9881

.9909

.9931

.9948

.9961

.9971

.9979

.9985

.9989

.9992

.9994

.9996

.9997

.5279

.5675

.6064

.6443

.6808

.7157

.7486

.7794

.8078

.8340

.8577

.8790

.8980

.9147

.9292

.9418

.9525

.9616

.9693

.9756

.9808

.9850

.9884

.9911

.9932

.9949

.9962

.9972

.9979

.9985

.9989

.9992

.9995

.9996

.9997

.5319

.5714

.6103

.6480

.6844

.7190

.7517

.7823

.8106

.8365

.8599

.8810

.8997

.9162

.9306

.9429

.9535

.9625

.9699

.9761

.9812

.9854

.9887

.9913

.9934

.9951

.9963

.9973

.9980

.9986

.9990

.9993

.9995

.9996

.9997

.5359

.5753

.6141

.6517

.6879

.7224

.7549

.7852

.8133

.8389

.8621

.8830

.9015

.9177

.9319

.9441

.9545

.9633

.9706

.9767

.9817

.9857

.9890

.9916

.9936

.9952

.9964

.9974

.9981

.9986

.9990

.9993

.9995

.9997

.9998

.01 .02 .03 .04 .05 .06 .07 .08 .09

Standard Normal Probabilities

Lab #8: Chi-Square-Confidence Intervals-Hypothesis Testing (3)

Objectives:

1. Chi-Square Analysis examples on Contingency Tables

2. Hypothesis Test and Confidence Intervals for Proportions

3. Hypothesis testing using T-Tests, Pooled and Normal

Chi-Square Analysis

Suppose that we want to analyze the following contingency table:

In order to import a contingency table in SAS we need to create a new dataset with the two categorical

variables and all the possible cases. We also need to define a dummy variable to hold all the numerical

entries. We call that variable counts. To create the dataset we use the code:

DATA Vaccines;

INPUT Autism $ Vaccinated $ count;

datalines;

Yes Yes 440

Yes No 184

No Yes 1240

No No 584

;

Notice the use of $ after the variables Autism and Vaccinated. Those are there to remind SAS that these

are categorical variables. Count, being a numerical variable, does not need a $ after it.

This has created and stored our dataset. In order for us to turn it into a contingency table and do the chi

square analysis we need to use the procedure freq. as follows:

TITLE ‘Contingency table and chi square analysis’;

PROC FREQ data=Vaccines ORDER=DATA;

tables Autism*Vaccinated/ chisq expected norow nocol nopercent;

weight count;

RUN;

The command: tables Autism*Vaccinated instructs SAS to create a contingency table using the two

variables Autism and Vaccinated. The rest of this command works as follows:

chisq : do the chi square analysis

expected: compute and display the expected values for each cell

norow: do not show the row marginal probabilities

nocol: do not show the column marginal probabilities

nopercent: do not show general probabilities

The command ORDER=DATA tells the procedure freq to present the data in the order we put them in (so

no flipping the YES and NO anymore!).

The output is something like this:

Frequency

Expected

Table of Autism by Vaccinated

Autism Vaccinated

no yes Total

no 584

572.24

1240

1251.8

1824

yes 184

195.76

440

428.24

624

Total 768

1680

2448

And the statistics table is:

Statistic DF Value Prob

Chi-Square 1 1.3827 0.2396

Likelihood Ratio Chi-Square 1 1.3931 0.2379

Continuity Adj. Chi-Square 1 1.2676 0.2602

Mantel-Haenszel Chi-Square 1 1.3821 0.2397

Statistic DF Value Prob

Phi Coefficient 0.0238

Contingency Coefficient 0.0238

Cramer's V 0.0238

Once again out of all the tests we focus on the first one. We say thus that the p value is 0.2396 which is

much larger than 0.05. So we fail to reject the null hypothesis. This dataset does not provide enough

evidence to support the claim that Autism and Vaccines are dependent.

Hypothesis Testing and confidence intervals for proportions

In the following section we will explore with sas the idea of a hypothesis testing with Z-scores for

proportions.

It is estimated that 8% of the population are colorblind. Pingelap, is a south pacific island with strange

nickname. It is called “the color blind island”, since a 26.4% of the population, 66 people, is colorblind and

almost everybody has some sort of color deficiency. Let’s test if the two ratios are truly different if we

know that the island has 250 people.

First we need to create a dataset in sas. Let’s call it Color and have two variables one for location and the

other for states of colorblindness.

DATA Color;

INPUT Location $ Colorblindness $ count;

datalines;

World Yes 8

World No 92

Pingelap Yes 26.4

Pingelap No 73.6

;

This has created our dataset. Now to view it and do the z-test comparison of the ratios we need to use

the following code:

TITLE "Z score for proportions test and 95% confidence ";

PROC FREQ data=Color ORDER=DATA;

tables Location*Colorblindness/ riskdiff(equal var=null cl=wald) norow nocol

nopercent alpha=0.05;

weight count;

RUN;

Once again the procedure frequency will output the table in a nice 2 by 2 format and once more we have

norow: do not show the row marginal probabilities

nocol: do not show the column marginal probabilities

nopercent: do not show general probabilities

The creation of the z-tables and happens because of the command riskdiff. In the documentation for this

command in SAS we learn that the binomial proportions are called "risks," so a "risk difference" is a

difference in proportions.

The command equal var=null tells sas that we don’t expect the variability of the two proportions to be

equal and the command cl=wald instructs it to use the Wald Method to compute the standard error, which

is the formula we learned in class i.e.

𝑆𝐸 = √𝑝1(1 − 𝑝1)

𝑛1+𝑝2(1 − 𝑝2)

𝑛2

You can learn more about the procedure freq here:

http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_freq_s

yntax08.htm

As you can see the 95% interval is computed as well for the difference of the two proportions. This is

obtained by adjusting the command alpha=0.05.

Let’s change that to 0.1 and see what happens with the code:

TITLE "Z score for proportions test and 90% confidence ";

PROC FREQ data=Color ORDER=DATA;

tables Location*Colorblindness/ riskdiff(equal var=null cl=wald) norow nocol

nopercent alpha=0.01;

weight count;

RUN;

Obviously there is no change in the proportions difference test and we get:

Proportion (Risk) Difference Test

H0: P1 - P2 = 0 Wald Method

Proportion Difference -0.1840

ASE (H0) 0.0534

Z -3.4477

One-sided Pr < Z 0.0003

Two-sided Pr > |Z| 0.0006

Column 1 (Colorblindness = Yes)

What changes in the confidence limits which gives us now:

Confidence Limits for the

Proportion (Risk) Difference

Proportion Difference = -0.1840

Type 90% Confidence Limits

Wald -0.2691 -0.0989

Column 1 (Colorblindness = Yes)

Hypothesis testing using T-Tests, Pooled and Normal

As you recall, the PROC IMPORT statement is the best way to enter external data sets. The CSV file we

will be using is called “Qs3small.csv”. Download and save it in your favorite folder and mark the complete

path to it. Then use the following code to import it, making sure you put the correct path on the DATAFILE

argument. Save the dataset as Qs3.

PROC IMPORT OUT= Qs3

DATAFILE= "put the complete path here…\Qs3small.csv"

DBMS=CSV REPLACE;

GETNAMES=YES;

DATAROW=2;

RUN;

This dataset contains some of the results from questionnaire 3. Let’s try to see if there is any difference

of the reported understanding of the lectures based on gender. To keep things tidy lets first sort our

dataset based on gender using the code:

PROC SORT Data=Qs3;

by Gender;

RUN;

And then lets print our dataset with proc print.

PROC PRINT Data=Qs3;

RUN;

Let’s also do some basic summary statistics on the average understanding of the whole population using

proc univariate as follows:

TITLE "Summary statistics for Lectures, all students";

PROC UNIVARIATE Data=Qs3 plot;

var Lectures;

RUN;

We notice that the total mean is 3.6951 and the standard deviation is 0.757. We proceed to compute the

summary statistics based on different genders with the slightly modified proc univariate

TITLE "Summary statistics for Lectures, by gender ";

PROC UNIVARIATE Data=Qs3 plot;

var Lectures;

by Gender;

RUN;

Notice that instead of getting 2 categories (M, F) we got 3! Fear not, one corresponds to one observation

that was missing the gender, so it became a group on its own. Gender=F

Moments

N 18 Sum Weights 18

Mean 3.61111111 Sum Observations 65

Std Deviation 0.63142126 Variance 0.39869281

Skewness 0.51682132 Kurtosis -0.6481221

Uncorrected SS 241.5 Corrected SS 6.77777778

Coeff Variation 17.4855119 Std Error Mean 0.14882742

Gender=M

Moments

N 22 Sum Weights 22

Mean 3.75 Sum Observations 82.5

Std Deviation 0.86945522 Variance 0.75595238

Skewness -0.5080564 Kurtosis 0.6958176

Uncorrected SS 325.25 Corrected SS 15.875

Coeff Variation 23.1854726 Std Error Mean 0.18536848

As we notice, we need to compare the two means, but we have different number of observations and

different standard deviations. Fear not! SAS will do the computation for us immediately.

But,first we need to take care of that missing value otherwise SAS will not know which groups to compare.

To do that we will create a new set called Qs3clean, by removing that observation without a gender entry

as follows:

DATA Qs3Clean;

SET Qs3;

if Gender = ' ' then delete;

RUN;

As you can see the new dataset Qs3Clean is created by using the original set Qs3 using the if statement,

that says if the variable gender is empty then delete it (very logical!)

Let’s see how this new dataset looks like with proc print.

TITLE "Questionnaire 3 without missing Gender";

PROC PRINT Data=Qs3Clean;

RUN;

So now we can continue with the ttest analysis which is basically the following simple code:

TITLE "T-test comparison between Lectures for different genders";

ods graphics on;

PROC TTEST Data=Qs3Clean cochran ci=equal umpu;

class Gender;

var Lectures;

RUN;

ods graphics off;

Notice that we need to define the variable for which we are doing the comparison as class and the

numerical variable as var (usual)

Also I turned on the graphics just in case I had turned it off at some point, to get all the graphs produced.

Finally the commands Cochran ci=equal umpu instruct SAS to unclude the Cochran approximation of

confidence intervals and ttest (another method of computation more “datadriven”) and ci=equal tells sas

to create an equal tailed confidence interval (default). The UMPU specifies an interval using the “best”

variance.

There are various different outputs popping up. The following table contains the summary statistics for

the two genders and their computed difference

Gender N Mean Std Dev Std Err Minimum Maximum

F 18 3.6111 0.6314 0.1488 3.0000 5.0000

M 22 3.7500 0.8695 0.1854 1.5000 5.0000

Diff (1-2) -0.1389 0.7721 0.2454

Then we get 95% confidence intervals for F, M and the difference if we used pooled standard errors or

not (Satterthwaite method)

Gender Method Mean 95% CL Mean Std Dev 95% CL Std Dev 95% UMPU CL Std Dev

F 3.6111 3.2971 3.9251 0.6314 0.4738 0.9466 0.4654 0.9257

M 3.7500 3.3645 4.1355 0.8695 0.6689 1.2425 0.6593 1.2206

Diff (1-2) Pooled -0.1389 -0.6356 0.3579 0.7721 0.6310 0.9951 0.6258 0.9856

Diff (1-2) Satterthwaite -0.1389 -0.6203 0.3425

Finally we get the corresponding p values for a two sided ttest comparison and again we get all cases

including a new one:

Method Variances DF t Value Pr > |t|

Pooled Equal 38 -0.57 0.5747

Satterthwaite Unequal 37.534 -0.58 0.5625

Cochran Unequal . -0.58 0.5658

The pooled is the one we did in class and Satterhwaite is the standard with the uneven ones. Cochran, is

another method of computing p-values using a completely different approximation that computes

variances and standard errors through the data.

We will focus on the first two depending on what the problem calls for. In this case though any of the

three is ok. And even if we did a one-sided hypothesis (halving all of them) the p values would be much

bigger than 0.05, so we fail to reject the null hypothesis. In other words the dataset does not provide

enough evidence to show a difference in understanding of the Lecture between genders.

Finally the table

Equality of Variances

Method Num DF Den DF F Value Pr > F

Folded F 21 17 1.90 0.1847

contains the F value and the corresponding p value for the comparison of the two variances. Notice that

we also get the histograms for free with the QQ-plots in case we are wondering if the results follow a

normal distribution.

To learn more about the ttest procedure follow this link:

https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_ttest_

sect002.htm

Lab #9: ANOVA and TUKEY tests

Objectives:

1. Column manipulation in SAS

2. Analysis of variance

3. Tukey test

4. Least Significant Difference test

5. Analysis of variance with PROC GLM

6. Levene test for Homoscedasticity

Column Manipulation

Analysis of variance is used when 3 or more different groups of observations need to be compared with

each other. The best scenario is for example comparing 3 or more different treatments on a population,

or 2 or more treatments and a control, in an experiment, where the number of samples is the same in

each group and independence within and between is controlled.

One such example is the dataset Diets.csv which you can find on our MOODLE page and contains the

results of weight loss for 75 individuals that followed 3 different programs. The question for the researcher

is first, are the outputs of the programs different? If yes which one seems to be better?

First let’s import our dataset by using the by now well-known proc import command as follows:

PROC IMPORT OUT= Diets

DATAFILE= "put the complete path here…\Diets.csv"

DBMS=CSV REPLACE;

GETNAMES=YES;

DATAROW=2;

RUN;

Using proc print we can view the dataset and notice that it contains the observation number, the gender

(0 female, 1 male) the age, the height and the diet number (1,2,3). It also contains 2 columns the

pre_weight and post_weight measured in kilos.

PROC PRINT DATA = Diets;

RUN;

What we would like to have is the difference of weight before and after the diet, so we can compare those

averages amongst the three diets. Thus we will create a new dataset, called DietClean that has the

difference in weight, and just the type of diet the person followed, disregrading all other information. To

do that we will use the same subseting tools we learned in the past, but this time we will create a new

column by doing arithmetics on variables, namely:

DATA DietsClean;

SET Diets;

WeightDif = pre_weight-post_weight;

KEEP Diet WeightDif;

RUN;

Notice that we created a new set by invoking the command DATA followed by the name of the new

dataset, in this case DietsClean. The command SET Diets instructs SAS to use the dataset Diets to create

the new columns. The line WeightDif = pre_weight-post_weight creates a new variable (column) called

WeightDif by subtracting the variable post_weight from the variable pre_weight. It is neat that SAS

understands column-wise operations! Finally with the command KEEP Diet WeightDif we instruct SAS to

keep only those two variables (columns) in the new dataset we created and discard the rest.

Go ahead and use proc print to see how the new dataset looks like.

PROC PRINT DATA = DietsClean;

RUN;

Before we do ANOVA, let’s explore our dataset, and do summary statistics for the different diets, including

the boxplots. PROC UNIVARIATE will do the trick once again. Always in these types of analyses start with

some exploratory statistics. It helps justify the need for ANOVA, regression or other “heavy machinery”!

ods graphics on;

TITLE "Statistics By Diet";

PROC UNIVARIATE Data=DietsClean plots;

var WeightDif;

by Diet;

RUN;

As you recall, the command by Diet will present the summary statistics, of the variable of choice (var

WeightDif) for the three different diets. Adding the command plots creates all the histograms and

boxplots, including the side by side boxplot at the end. Already from that plot we can be sure that there

is a difference between the diets and probably the last one is the best. But we will analyze it further to be

absolutely sure. Just to be on the safe side I added the global command ods graphics on, in case you

turned it off earlier on since we do need to see the output plots.

ANALYSIS OF VARIANCE

Once again, remember that our null and alternative hypotheses are

𝐻0: 𝜇1 = 𝜇2 = 𝜇3

𝐻𝐴: 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑒𝑠𝑒 𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑖𝑒𝑠 𝑖𝑠 𝑛𝑜𝑡 𝑡𝑟𝑢𝑒

Notice that all three groups have the same amount of observations (25). This is called a “balanced”

experiment, and it is preferable. ANOVA WILL work on “unbalanced groups” just fine but it is easier to

showcase the validity of it if the groups have the same amount of observations. As a matter of fact, the

ANOVA procedure that we will use works better if you have the same amount of observations, and the

SAS programmers suggest using a different procedure, PROC GLM, when dealing with unbalanced

experiments. We‘ll get to that soon. For now the following code will do the trick:

PROC ANOVA Data=DietsClean;

class Diet;

model WeightDif=Diet;

RUN;

As you can see a very easy code, and it is also easy to explain. First of all we are using the procedure

ANOVA on the dataset DATA=DietsClean. Then we need to use the command class which provides the

variable (column) that contains the different groups. In this case we have the 3 diets (1, 2, 3) found in the

column Diet, hence class Diet is used. Then SAS wants to know what variable to analyze based on what

class, basically how to create the analysis model. The syntax for that is model “the variable of interest” =

“the variable of different classes”.

A couple of tables pop up after the code is implemented. We have the general description:

Statistics By Diet

The ANOVA Procedure

Class Level Information

Class Levels Values

Diet 3 1 2 3

Number of Observations Read 75

Number of Observations Used 75

SAS informs us that it used all the observations (75 out of 75) and that it sees 3 levels (groups) with values

1,2,3 respectively in the column that contains the groups (in this case Diet).

This is followed by the table we analyzed in class, which is the main table you should report when doing

an analysis of variance, namely:

Statistics By Diet

The ANOVA Procedure

Dependent Variable: WeightDif

Source DF Sum of Squares Mean Square F Value Pr > F

Model 2 52.6344000 26.3172000 4.60 0.0131

Error 72 411.6624000 5.7175333

Corrected Total 74 464.2968000

If we focus on the top right cell, we see that our p-value is 0.0131 which is smaller than the default alpha

0.05. So we reject the null hypothesis and claim that at least one of the diets has a different average

weight loss than the rest.

We also get the boxplot of the distributions with the F value and the p value as a nice plot to put in our

posters!

Tukey Test

Let’s proceed now with the post-ANOVA analysis and start with pairwise “adjusted” comparisons, namely

the Tukey test. This can be achieved by extending the ANOVA code and adding the means computation,

namely:


class Diet;


means Diet/ tukey alpha=0.05;

RUN;

The line means Diet/ tukey instructs SAS to compare the means of the groups found in the variable Diet

using the tukey pairwise comparison test we learned in lecture. The confidence level is indicated by the

parameter alpha=0.05.

What we obtain is a “grouping” of means that look the same, vs groupings of means that look different.

In the form of the following table:

Means with the same letter

are not significantly different.

Tukey Grouping Mean N Diet

A 4.9080 25 3

B 3.1680 25 1

B

B 3.0960 25 2

This implies that the average weight-loss for diets 1 and 2 is similar, therefore we group them together

under the arbitrary “B” group symbol. On the other hand, Diet 3 seems different from the other two so it

creates its own grouping, again arbitrarily given the letter “A”.

Least Significant Difference test

Another interesting pairwise comparison we can do is that of the Least Significant Difference (LSD), or the

Fisher procedure as it is known. Remember that this method again performs a T-test comparison, between

the differences of all possible pairs of means using the following formula for the standard error:

𝑆𝐸 = √2 ∗ 𝑀𝑆𝑊

𝑚

Where MSW is the mean square error within groups and m is the number of observations in each group.

Then instead of outputting the p-values it outputs a symbol to indicate significant differences between

the groups and a confidence interval around the pairwise difference between the means. Once again we

need to augment the ANOVA code a bit to get:


class Diet;


means Diet/ lsd cldiff alpha=0.05;

RUN;

Once again, the line means Diet /lsd cldiff alpha=0.05 instructs SAS to pairwise compare the means for

each group found in the variable Diet by using the LSD test. Cldiff will output the confidence interval

differences and alpha=0.05 is the user defined alpha level.

The output table looks like this:

Comparisons significant at the 0.05 level

are indicated by ***.

Diet

Comparison

Difference

Between

Means

Simultaneous 95% Confidence

Limits

3 - 1 1.7400 0.1215 3.3585 ***

3 - 2 1.8120 0.1935 3.4305 ***

1 - 3 -1.7400 -3.3585 -0.1215 ***

1 - 2 0.0720 -1.5465 1.6905

2 - 3 -1.8120 -3.4305 -0.1935 ***

2 - 1 -0.0720 -1.6905 1.5465

As you see all comparisons happen twice, but it is good to have both, depending on which Diet we want

to put first or second in our pairwise comparisons. Again we can see that the comparison between 1 and

2 does not yield a significant difference, so once again diets 1 and 2 are almost the same. Diet 3 on the

other hand creates significant differences from both 1 and 2, as we can see in the highlighted part of the

table.

Analysis of Variance with PROC GLM

So far we have been working with a balanced experiment, where all the subgroups have the same number

of samples. Suppose though we want to use unbalanced groups. For example, let’s explore the dataset

Tomato.csv found on MOODLE. In this dataset the researchers have catalogued 3 versions of the same

tomato breed M82. Besides the regular one, they have created two strands A, B by selective breeding

with other varieties and they have catalogued various characteristics of the roots and plants after a few

weeks of cultivation.

Our question is to check if there is a difference in the average length of the roots among the three varieties

which we will answer by doing an unbalanced one way ANOVA on the last column titled “length”.

Let’s import that dataset

PROC IMPORT OUT= Tomato

DATAFILE= "put the complete path here…\Tomato.csv"

DBMS=CSV REPLACE;

GETNAMES=YES;

DATAROW=2;

RUN;

Using proc print we can see that the dataset contains 42 (!?!) observations. The main variables of interest

are Genotype and Length. We could create a smaller dataset that contains only these 2 but we will

proceed with all of them for now.

PROC PRINT DATA = Tomato;

RUN;

Before we analyze our data, we should sort it by genotype using the code:

PROC SORT DATA = Tomato;

By = Genotype;

RUN;

Let us try some summary statistics first by the different genotypes using the code:

ods graphics on;

TITLE "Statistics By Genotype";

PROC UNIVARIATE Data=Tomato plots;

var Length;

by Genotype;

RUN;

One of the things we get is that the three groups M82, A and B have different number of observations

(13, 14 and 15 respectively). This could be due to errors in inputting, failures in crops, failures in saving

the data and others. In this case the experimenter could either

a) Make all groups equal by removing the appropriate amount of observations at random from the

groups with extra observations.

b) Utilize an unbalanced one way ANOVA through PROC GLM.

Here we will present how the second procedure is implemented including LSD test. As a matter of fact,

the code is VERY similar to PROC ANOVA, but instead the mean procedure is PROC GLM as follows.

Title “Unbalanced ANOVA for Tomato varieties”;

PROC GLM Data=Tomato;

class Genotype;

model Length=Genotype;

means Genotype/ lds cldiff alpha=0.05;

RUN;

We basically use PROC GLM (general linear model) instead of PROC ANOVA. The procedure GLM is very

useful and can be thought of as model exploration which includes ANOVA as one of its procedures. More

on that later.

Let’s explore the tables now

Class Level Information

Class Levels Values

Genotype 3 Line A Line B M82

Number of Observations Read 42

Number of Observations Used 42

This table gives us general information about our model. We have three different groups, line a, line b and

M82 and we used all 42 observations in the analysis. The familiar table:


Model 2 25918.388 12959.194 0.45 0.6425

Error 39 1129472.088 28960.823


follows which contains the p value. Since that is large, we fail to reject the null hypothesis and conclude

that there does not seem to be a difference between the average lengths of roots in the three varieties.

The pairwise comparisons using the Fishers LSD test also give no significant pairs (no *** in any

comparison) as we can see here:

Comparisons significant at the 0.05 level are indicated by ***.

Genotype

Comparison

Difference

Between

Means

95% Confidence Limits

Line A - M82 25.54 -104.90 155.97

Line A - Line B 59.71 -68.20 187.63

M82 - Line A -25.54 -155.97 104.90

M82 - Line B 34.18 -98.41 166.76

Line B - Line A -59.71 -187.63 68.20

Line B - M82 -34.18 -166.76 98.41

Levene test for Homoscedasticity:

To test if the variances for the different groups are the same of statistically different we can use again

proc glm to produce the Levene test. The code is similar as before namely:

Title “Anova with Levene test for Tomato varieties”;

PROC GLM Data=Tomato;

class Genotype;

model Length=Genotype;

means Genotype/ HOVTEST alpha=0.05;

RUN;

What was added was the command HOVTEST after the means, which instructs SAS to run a Levene test

after finishing with the ANOVA computations.

After getting again all the information about the ANOVA tables as before we also get the table:

Levene's Test for Homogeneity of Length Variance

ANOVA of Squared Deviations from Group Means


Genotype 2 1.198E10 5.9899E9 3.02 0.0605

Error 39 7.745E10 1.9859E9

As we can see the p value is slightly larger than 0.05, so we fail to reject the null hypothesis and

conclude (marginally) that the variances amongst the three groups are probably the same.

Lab # 10: Regression Analysis

Objectives:

1. Scatterplot with Regression Line

2. Linear Regression

3. Saving outputs and further analysis

Scatterplot

As mentioned in the lecture, the first thing one should try when

analyzing two numerical variables trying to identify some correlation

between them is a simple scatterplot. We mentioned this process in lab

2 but now it’s a good time to review and expand on the ways of creating

scatterplots.

The example we will use today is the dataset Iris.csv which you can find

on our MOODLE page and contains the measurements of the sepal

length and width of the beautiful wildflower Iris Setosa which you can

see on the left. This is a circum-arctic wildflower found in Alaska,

Northern USA, Russia, Northeastern Asia, China, Korea and Japan and it

is considered an endangered species (critically threatened).

First let’s import our dataset by using the by now well-known proc import command as follows:

PROC IMPORT OUT= Iris

DATAFILE= "put the complete path here…\Iris.csv"

DBMS=CSV REPLACE;

GETNAMES=YES;

DATAROW=2;

RUN;

Using proc print we can view the dataset and notice that it contains the length and the width of the sepals

measured in centimeters.

PROC PRINT DATA = Iris;

RUN;

Recall, that creating a scatterplot can be done with the PROC GPLOT which we learned in lab 2 as follows:

SYMBOL V=circle C=Blue;

Title “Scatterplot Width vs Length”;

PROC GPLOT Data=Iris;

Plot Width*Length;

RUN;

The SYMBOL command instructs SAS to use blue circles to represent the points. The line Plot

Width*Length says that the Width variable will be represented on the y axis and the variable Length on

the x axis.

So Gplot creates beautiful graphs and it is very versatile, but the best way to create scatterplots is using

the procedure SGSCATTER. This procedure also fits and outputs various lines to our scatterplot including

a linear regression. The code is as follows:

Title “Scatterplot with regression line Width vs Length”;

PROC SGSCATTER Data=Iris;

Plot= Width*Length/ Reg= (nogroup degree=1) grid;

RUN;

The command responsible for creating the

regression is the: Reg = (nogroup degree=1).

This instructs SAS to add a regression line of

degree 1 (a straight line) and to not group the

variables using some other variable. If for

example you had two groups of observations (2

different species) SAS would fit a regression line

to each group and present all of them in a nice

graph. The command grid creates a grid of lines

on the graph so that you can easily see the

values of the points.

Linear Regression (Simple)

We will now try to create a simple linear regression model and analyze all the results. The procedure here

is PROC REG used as follows:

Title “Simple Linear Regression”;

PROC REG Data=Iris;

Model Width=Length;

RUN;

Just like in the PROC GLM the connection between the variables is denoted by the line

Model Width=Length; If no other variables are present, SAS will automatically try a linear regression

between the two variables presented in the model. Let’s now try to explain all the outputs starting by the

analysis table:

Analysis of Variance

Source DF Sum of

Squares

Mean

Square

F Value Pr > F

Model 1 3.34457 3.34457 44.57 <.0001

Error 46 3.45210 0.07505


Notice that just like in class, the first table presented is an F statistic with the same set up as ANOVA. The

hypotheses here are:

𝑯𝟎: 𝜷𝟏 = 𝟎 , 𝑯𝑨: 𝜷𝟏 ≠ 𝟎

In other words we assume that the slope 𝜷𝟏 in the formula

𝑾𝒊𝒅𝒕𝒉 = 𝜷𝟎 + 𝜷𝟏 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝜺

is zero, which would imply that Length does not help predict the Width. Since our p value is extremely

small we reject the null hypothesis and confirm that length CAN be used to predict width and that the

regression model might be a good predictor model for the relationship between the two.

The second table:

Root MSE 0.27394 R-Square 0.4921

Dependent Mean 3.44167 Adj R-Sq 0.4810

Coeff Var 7.95965

contains the Root MSE which is the square root of the Mean Square error we talked about in class. The

parameter estimates can be found in the table:

Parameter Estimates

Variable DF Parameter

Estimate

Standard

Error

t Value Pr > |t|

Intercept 1 -0.28624 0.55981 -0.51 0.6116

Length 1 0.74434 0.11150 6.68 <.0001

Based on those results our model becomes:

𝑾𝒊𝒅𝒕𝒉 = −0.28624 + 0.74434 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝜺

The program also gives us the residuals for each observation pair in a scatterplot.

This basically shows how far apart from zero are all the residuals for each point. As we can see there are

somewhat evenly distributed around zero, with a moderate spread. So it seems like our model is a good

fit.

In the fit diagnostics panel we can get a lot more information about the residuals and if we focus on their

histogram (bottom left corner) we can safely say that the residuals DO follow a normal distribution around

zero. (One of the requirements of regression). Further evidence for that is the QQ plot of the residuals

(middle left graph). On the bottom right corner we see a summary of the regression analysis. Another

interesting graph is the middle one. To create it, SAS is using the model and the Length values and predicts

the Width values. Then it plots the predicted values vs the real ones. It would be ideal if all the plotted

points were on the y=x axis, but still our result is not that bad.

Finally we get a much better scatterplot with the 95% confidence interval lines and the 95% Prediction

limits (which we will talk about next week). This is the plot that I prefer putting in my results since it also

has a summary on the side.

Saving outputs and Further Analysis:

So far we have been interested in statistical outputs and graphs coming from SAS. But what if we wanted

to get the output from a tedious computation for further analysis. For example, in the computation of

linear regression we discussed that computing the residuals is one of the most important aspects.

Furthermore, testing the normality of their distribution is crucial. Earlier in the semester we learned a test

for normality with the Kolmogorov-Smirnov test.

The following code will allow us to save the extracted residuals in a new dataset and then apply any

analyses from the ones that we have already learned. The command is output and it can be used in other

SAS procedures not just PROC REG.

Title “Saving the residuals”;

PROC REG Data=Iris;

Model Width=Length;

Output out=Results r=res;

RUN;

The extra line Output out=Results r=res; creates a new dataset utilizing the output of the process it is

attached to. The name of the dataset is Results and it is given by the command out=Results. Since we

want to keep the residuals in the output dataset we need to give the residuals a name in the output

dataset. The way to signify residuals is simply r and so I gave them the name res with the command r=res.

Let’s go ahead and PROC PRINT the dataset results and see what it looks like:

PROC PRINT DATA = Results;

RUN;

The dataset contains the columns of width and length but also the column of the residuals as a new

variable res. We can now apply all the tools we know and especially the normality test we learned in lab

7 namely:

TITLE “How Normal are the residuals?”;

Ods select Histogram ParameterEstimates GoodnessOfFit FitQuantiles Bins;

PROC UNIVARIATE DATA = Results;

HISTOGRAM res/ normal(percents=20 40 60 80 midpercents);

INSET n normal(ksdpval) / pos = ne format =6.3;

RUN;

Since the KS statistic is greater than 0.05 we can safely say that our residuals probably follow a normal

distribution and everything is fine.

Goodness-of-Fit Tests for Normal Distribution

Test Statistic p Value

Kolmogorov-Smirnov D 0.07706671 Pr > D >0.150

Cramer-von Mises W-Sq 0.03377812 Pr > W-Sq >0.250

Anderson-Darling A-Sq 0.22661831 Pr > A-Sq >0.250

For more information on the output functionality for regression in SAS check here:

https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_reg_s

ect015.htm

Lab # 11: Correlation and Model Fitting

Objectives:

1. Correlations between variables

2. Data Manipulation, creation of squares

3. Model fitting with regression

4. Comparison of models

Correlations between variables

Throughout the semester we have learned various methods of trying to identify relationships between

variables. When they were categorical our only option was to check the corresponding contingency table

and try the chi-square analysis. When we were trying to compare a numerical variable with a categorical

one we used TTest and ANOVA. In this section we will explore the idea of comparing two numerical

variables, for which we have a one to one correspondence between each value in our dataset. We will do

that with the use of the correlation coefficient (Pearson) mathematically and a simple scatterplot visually.

We will then use those results to build a model that will predict our output variable using the input

variables as best as possible.

The example we will use today is the dataset Larvae.csv which you can find on our

MOODLE page and contains the measurements of the Headwidth, Length and Mass,

including the gender for the Sirex Noctilio Larvae that we have seen in the past. Our

output variable here will be Mass and we will try to correlate that to Headwidth and

Length respectively. Apparently, Mass is the hardest measurement of the three, since

both Headwidth and Length are performed automatically by a modified photometer-

microscope, so a model to infer Mass based on the other two would be useful.

First let’s import our dataset by using the by now well-known proc import command as

follows:

PROC IMPORT OUT= Larvae

DATAFILE= "put the complete path here…\Larvae.csv"

DBMS=CSV REPLACE;

GETNAMES=YES;

DATAROW=2;

RUN;

The dataset is quite large, so we will modify a bit the proc print procedure to only output the first 10

observations only

PROC PRINT DATA = Larvae (obs=10);

RUN;

What we would like to do then is compute the correlation coefficients for the pairs Mass with Length and

Mass with Headwidth respectively:

The procedure for that is sas is PROC COR. The following code will compute those correlations and output

them in a nice matrix way

Title “Correlations Mass vs Length and Mass vs Headwidth”;

PROC CORR Data=Larvae;

Var Mass;

With Length Headwidth;

RUN;

We declare the output variable by using “VAR Mass” and the variables we want to compare it with (inputs)

by using “With Length Headwidth”. After a table with simple statistics for each variable, we get the t he

pearson correlations between Mass and the other two in a near looking table.

Pearson Correlation Coefficients, N = 4895

Prob > |r| under H0: Rho=0

Mass

Length 0.83863

<.0001

Headwidth 0.80712

<.0001

As we see, both correlation coefficients are quite high (close to 1) and thus we can assume that Mass is

positively correlated with Length and it is also positively correlated with Headwidth.

Now if one wants to just compute all possible correlations amongst variables, the code is actually just:

Title “Correlations amongst all variables”;

PROC CORR Data=Larvae;

RUN;

As you see if you don’t declare what is your input and what is your output, SAS is going to compute all

pairwise correlations.

Pearson Correlation Coefficients, N = 4895

Prob > |r| under H0: Rho=0

Length Headwidth Mass

Length 1.00000

0.85269

<.0001

0.83863

<.0001

Headwidth 0.85269

<.0001

1.00000

0.80712

<.0001

Mass 0.83863

<.0001

0.80712

<.0001

1.00000

Armed with the knowledge that Mass is correlated with the other two variables, let’s create the

corresponding scatterplots that will showcase that connection using SGSCATTER. First we will see a

relationship between Mass and Length by using the code which will also output the best fit regression

line:

Title “Scatterplot with regression line Mass vs Length”;

PROC SGSCATTER Data=Larvae;

Plot Mass*Length/ Reg= (nogroup degree=1) grid;

RUN;

Then let’s explore the relationship between Mass and Headwidth by the following code

Title “Scatterplot with regression line Mass vs Headwidth”;


Plot Mass*Headwidth/ Reg= (nogroup degree=1) grid;

RUN;

Since we have information about the gender of the larvae, we could reflect that in our analysis by creating

colored scatterplots and different fits. You can do that by extending the codes using the group=gender

add on and removing the “nogroup” statement from the regression.

Title “Scatterplot with regression line Mass vs Length by genders”;


Plot Mass*Length/ group=Gender Reg= (degree=1) grid;

RUN;

Similarly for Mass compared to Headwidth:

Title “Scatterplot with regression line Mass vs Headwidth by genders”;


Plot Mass*Headwidth/ group=Gender Reg= (degree=1) grid;

RUN;

Unfortunately, the multitude of red points overshadows the blue making this harder to read. But still this

is a much better looking picture.

Before we move forward, just by looking at the shape of the scatterplots, it is almost obvious that a higher

degree polynomial would fit a little bit better. Let’s try that with a second degree polynomial for both

independent of genders.

Title “Scatterplot with square regression line Mass vs Length”;


Plot Mass*Length/ Reg= (nogroup degree=2) grid;

RUN;

Title “Scatterplot with square regression line Mass vs Headwidth”;


Plot Mass*Headwidth/ Reg= (nogroup degree=2) grid;

RUN;

Besides being IMMENSLY more aesthetically pleasing, this also gives as an idea as to what the right

regression model might be. Namely, perhaps it is the squares of Length and Headwidth that we should be

using to predict the mass, not the Length and Headwidth themselves.

Data Manipulations, creation of Squares of columns

In the previous section we noticed that having the square of Length and Headwidth as input might be a

good idea. Thus we will instruct SAS to expand our dataset by adding two new columns, called Length2

and Headwidth2 corresponding to the squares of Length and Headwidth respectively.

Basically we will create a new dataset called Larvae2 extending the old one as follows:

Data Larvae2;

SET Larvae;

Length2= Length*Length;

Headwidth2=Headwidth*Headwidth;

RUN;

Again, we can view the first 10 observations with

PROC PRINT DATA = Larvae2 (obs=10);

RUN;

The new columns of the squares of Length and Headwidth are now added. Using similar methods we can

create any algebraic expression for our variables or even between variables and save them as new

columns.

Model fitting with regression

Suppose now that we want to find the best regression model that fits our dataset, and use that to predict

the mass given the other information. We will be using the extended dataset Larvae2. We start by a simple

regression between Mass and Length followed by one for Mass and Width.

Title “Linear Regression Mass vs Length”;

PROC REG Data=Larvae2;

Model Mass=Length;

RUN;

Title “Linear Regression Mass vs Headwidth”;


Model Mass=Headwidth;

RUN;

Giving us the following plots

And

Looking at the R-square the regression using the length performs better than that of the Headwidth since

0.7033 is larger than 0.6514. (The closer to 1 the better)

Looking at the MSE the regression using the length again performs better than that of the Headwidth

since 0.0035 is smaller than 0.0041. (The closer to 0 the better)

Thus if we were to use one linear model it should be the one with Length. By looking at both the ANOVAs

we can also say that both Length and Headwith are connected to Mass, although our previous analysis

with the correlations is a stronger indicator.

Finally the corresponding formulas for the regressions are:

𝑴𝒂𝒔𝒔 = −𝟎.𝟏𝟒𝟔𝟎𝟏+ 𝟎. 𝟏𝟔𝟐𝟓 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝜺

And:

𝑴𝒂𝒔𝒔 = −𝟎.𝟐𝟑𝟖𝟒𝟐+ 𝟏. 𝟒𝟖𝟐𝟓𝟏 ∗ 𝑯𝒆𝒂𝒅𝒘𝒊𝒅𝒕𝒉 + 𝜺

Comparison of Models:

The rest of this section is a preview of higher level classes. It also serves to point out that the statistical

analysis that we did in regression translates to other models not just straight lines. We will attempt various

models to predict Mass from the other variables and output a measure to compare them by.

First let us run various models by using different combinations in the line mode Mass= …

a) Mass as a linear function of Length and Headwidth.

This is an example of a multilinear model thus we are aiming for a relationship of the form:

𝑴𝒂𝒔𝒔 = 𝜷𝟎 + 𝜷𝟏 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝜷𝟐 ∗ 𝑯𝒆𝒂𝒅𝒘𝒊𝒅𝒕𝒉 + 𝜺

The appropriate code is

Title “Linear Regression Mass vs Length and Headwidth”;


Model Mass=Length Headwidth;

RUN;

The result is very similar to the one we get from classic regression. The parameter estimates contains the

coefficients we want which are read as follows:

Parameter Estimates


Estimate

Standard

Error

t Value Pr > |t|

Intercept 1 -0.20214 0.00322 -62.74 <.0001

Length 1 0.10679 0.00273 39.07 <.0001

Headwidth 1 0.61936 0.02591 23.90 <.0001

𝑴𝒂𝒔𝒔 = −𝟎.𝟐𝟎𝟐𝟏𝟒 + 𝟎. 𝟏𝟎𝟔𝟕𝟗 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝟎. 𝟔𝟏𝟗𝟑𝟔 ∗ 𝑯𝒆𝒂𝒅𝒘𝒊𝒅𝒕𝒉 + 𝜺

What we also need to note here is the R-square value found in the table:



Coeff Var 63.89541

This will be our way of comparing different models. The one with the highest R-square is the one we need.

b) Mass as a function of Length and Length squared.

The formula then is

𝑴𝒂𝒔𝒔 = 𝜷𝟎 + 𝜷𝟏 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝜷𝟐 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉𝟐 + 𝜺

The corresponding code is:

Title “Mass vs Length and Length2”;


Model Mass=Length Length2;

RUN;

Which yields:

Parameter Estimates


Estimate

Standard

Error

t Value Pr > |t|

Intercept 1 0.03243 0.00376 8.61 <.0001

Length 1 -0.07522 0.00453 -16.59 <.0001

Length2 1 0.06857 0.00126 54.34 <.0001

And thus:

𝑴𝒂𝒔𝒔 = 𝟎. 𝟎𝟑𝟐𝟒𝟑 − 𝟎. 𝟎𝟕𝟓𝟐𝟐 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝟎. 𝟎𝟔𝟖𝟓𝟕 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉𝟐 + 𝜺

The R-square value is then:



Coeff Var 53.32208

Let us compare these two new models and add the linear regressions from before:

Model Formula R-square

1 𝑴𝒂𝒔𝒔 = −𝟎.𝟏𝟒𝟔𝟎𝟏+ 𝟎.𝟏𝟔𝟐𝟓 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝜺 0.7033 2 𝑴𝒂𝒔𝒔 = −𝟎.𝟐𝟑𝟖𝟒𝟐+ 𝟏.𝟒𝟖𝟐𝟓𝟏 ∗ 𝑯𝒆𝒂𝒅𝒘𝒊𝒅𝒕𝒉 + 𝜺 0.6514 3 𝑴𝒂𝒔𝒔 = −𝟎. 𝟐𝟎𝟐𝟏𝟒 + 𝟎. 𝟏𝟎𝟔𝟕𝟗 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝟎. 𝟔𝟏𝟗𝟑𝟔 ∗ 𝑯𝒆𝒂𝒅𝒘𝒊𝒅𝒕𝒉 + 𝜺 0.7343 4 𝑴𝒂𝒔𝒔 = 𝟎. 𝟎𝟑𝟐𝟒𝟑 − 𝟎. 𝟎𝟕𝟓𝟐𝟐 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉 + 𝟎. 𝟎𝟔𝟖𝟓𝟕 ∗ 𝑳𝒆𝒏𝒈𝒕𝒉𝟐 + 𝜺 0.8150

As we mentioned earlier the one with the highest R-square value is the one we should be focusing on. So

we say that the last model is the one that predicts Mass the best.

statistical techniques notes and formula sheet exam 2

Documents