sas procedures

44
SAS Procedures Anil Kumar

Upload: sarathannapareddy

Post on 13-Nov-2014

131 views

Category:

Documents


6 download

DESCRIPTION

Proc tabulate, Gplot, Glimmix, Proc Reg, Proc Anova, Proc Mixed, Proc catmod, Proc Genmod

TRANSCRIPT

Page 1: SAS Procedures

SAS Procedures

Anil Kumar

Page 2: SAS Procedures

PROC tabulate

• Summarize the data in the form of a well organized table

• Syntax:

PROC tabulate DATA=dataname;

ClASS class variables;

VAR variables;

TABLE page, row, column description/options;

RUN;

Page 3: SAS Procedures

PROC tabulate – example (1)

proc tabulate data=sashelp.Class;class sex;var height weight;table sex, height weight;

run;

Result:

Page 4: SAS Procedures

PROC tabulate – example (2)

proc tabulate data=sashelp.Class;class sex;var height weight age;table sex all, (age height weight)*(std mean sum);

run;

Result:

Page 5: SAS Procedures

Gplot – A simple example

• SAS/ Graph modular is feathered by the flexible PROC gplot

• A simple example:

proc gplot data=sashelp.Class;symbol i=none v=star;plot height*weight;

run;quit;

Page 6: SAS Procedures

Resulting graph

Page 7: SAS Procedures

Gplot – further example

• The following example shows more flexibility of the procedure

goptions reset=all;proc gplot data=sashelp.Class;

symbol1 color = green i = join v= diamond line = 1 w=2 h=2;symbol2 color = red i= join v= star line = 2 w=2 h=2;plot Height*Weight=Sex/ hminor=0 legend=legend1;legend1 down=1 position=(top center inside) cshadow = blue frame value = (f=duplex)ACROSS =1label=(font=duplex h=1.5);title f= zapf color=blue h =5pct 'Testing the graph';

run;

Page 8: SAS Procedures

Model Output VariableOutput Variable Types of InputsTypes of Inputs AssumptionsAssumptions

ANOVA Interval Categorical, Fixed Effects only

Normality

REG Interval Interval, Fixed Effects only Normality

LOGISTIC Binary Categorical, Interval, Fixed Effects Only

Log-Normal

GLM Interval Categorical, Interval, Fixed Effects Only

Normality

GENMOD Categorical, Interval Categorical, Interval, Fixed Effects Only

Exponential Family

MIXED Interval Categorical, Interval, Random Effects

Normality

GLIMMIX Categorical, Interval, Random Effects

Categorical, Interval, Random Effects

Exponential Family

Cerrito

General Outline of Model Choices

Page 9: SAS Procedures

PROC REG

• Inputs and output are interval

• Ordinal data may be included

• Assumptions on – Normally distributed– has mean zero and constant variance– Is independent

• Residual analysis should be a routine part of the analysis

Page 10: SAS Procedures

Residuals

• The studentized residual, the RSTUDENT statistic, is similar to the the standardized residual except that the mean square error is calculated omitting the observation.

• Observations with studentized residual absolute values of greater than 2 are potential outliers.

Page 11: SAS Procedures

Regression Example

Page 12: SAS Procedures

Output

Page 13: SAS Procedures

Scatterplot With Regression Line

Page 14: SAS Procedures

Residuals

Page 15: SAS Procedures
Page 16: SAS Procedures

PROC ANOVA• Each treatment should have exactly the

same number of observations; every categorical outcome has the same number of observations.

• Caution: If you use PROC ANOVA for analysis of unbalanced data, you must assume responsibility for the validity of the results.

• Use PROC GLM instead.

Page 17: SAS Procedures

Categorical Procedures

Model Output VariableOutput Variable Types of InputsTypes of Inputs AssumptionsAssumptions

LOGISTIC Binary Categorical, Interval, Fixed Effects Only

Log-Normal

CATMOD Analyzes data that can be represented by a two-dimensional contingency table. Input can be raw data, cell counts, or direct input of a covariance matrix

GENMOD Categorical, Interval Categorical, Interval, Fixed Effects Only

Exponential Family

GLIMMIX Categorical, Interval, Random Effects

Categorical, Interval, Random Effects

Exponential

Family

Page 18: SAS Procedures

PROC CATMOD

• PROC CATMOD provides a wide variety of categorical data analyses.

• Now that PROC LOGISITIC handles classification variables, there is less of a need to use PROC CATMOD for regression.

• PROC CATMOD should not be used when a continuous input variable has many distinct values.

Page 19: SAS Procedures
Page 20: SAS Procedures

Output

Page 21: SAS Procedures
Page 22: SAS Procedures

Logistic Regression

• Binary outcomes• Allows for any combination of nominal, ordinal or

continuous explanatory variables• Computes predicted values, the receiver

operating characteristics (ROC) curve and an approximation to the area beneath the curve ( c ), and a number of regression diagnostics

• If the occurrence is rare, use the Poisson distribution in PROC GENMOD.

Page 23: SAS Procedures

Generalized Linear Models In generalized linear models the response is assumed to

possess a probability distribution of exponential form. That is, the probability density of the response Y for continuous response variables, or the probability function for discrete responses, can be expressed as

for some functions a, b, and c that determine the specific distribution (omitting some requirements for these functions). Expressions for the mean and variance are

Important to note is that the exponential family (or form) of distributions constitute a broad class of probability density functions. Don’t confuse this broad family with the exponential pdf.

Page 24: SAS Procedures

Distributions and Associated Default Link Functions Available in PROC

GENMOD

Page 25: SAS Procedures
Page 26: SAS Procedures
Page 27: SAS Procedures
Page 28: SAS Procedures

Model Output VariableOutput Variable Types of InputsTypes of Inputs AssumptionsAssumptions

ANOVA Interval Categorical, Fixed Effects only

Normality

REG Interval Interval, Fixed Effects only Normality

GLM Interval Categorical, Interval, Fixed Effects Only

Normality

GENMOD Categorical, Interval Categorical, Interval, Fixed Effects Only

Exponential Family

MIXED Interval Categorical, Interval, Random Effects

Normality

GLIMMIX Categorical, Interval, Random Effects

Categorical, Interval, Random Effects

Exponential Family

Interval (Quantitative) Procedures

Page 29: SAS Procedures

Assessing Goodness of Fit -Akaike’s Information Criterion (AIC)

• Information criteria uses the covariance matrix and the number of parameters in a model to calculate a statistic that summarizes the information represented by the model by balancing a trade-off between a lack of fit term and a penalty term.

• SAS calculates Akaike’s Information Criterion (AIC) for every possible 2p models for p ≤ 10 independent variables.

• AIC estimates a measure of the difference between a given model and the “true” model. The model with the smallest AIC among all competing models is deemed the best model.

• Beal’s example provides SAS code that can be used to simultaneously evaluate up to 1024 models to determine the best subset of variables that minimizes the information criteria among all possible subsets.

Page 30: SAS Procedures

Minimum AIC

• The AIC statistic is widely used to select the best model among alternative parametric models. • AIC = - 2( maximum log-likelihood) +

2( number of free parameters) • The amount of AIC is not meaningful.• The difference of the two AIC values is

considered insignificant if it is far less than 1.

Page 31: SAS Procedures

Beal’s Simulation

• Implements five common statistical techniques to determine the best linear model – minimizing the RMSE– maximizing R2

– forward selection– backward elimination– Stepwise regression

• The RMSE is a function of the sum of squared errors (SSE), number of observations n and the number of parameters p:

RMSE =sqrt(SSE/(n - p))

Page 32: SAS Procedures

Generate the Data

Page 33: SAS Procedures

Partial Code for Regressions

Page 34: SAS Procedures

Simulation Results: n=1000

Page 35: SAS Procedures

Simulation Result: n=10000

Page 36: SAS Procedures

AIC Selected Coefficients for Five Runs

Page 37: SAS Procedures

Generalized Linear Mixed Models

Page 38: SAS Procedures

PROC MIXED

• The mixed model generalizes the standard linear model: y=X + Z +

• is an unknown vector of random-effects parameters with known design matrix Z, and is an unknown random error vector whose elements are no longer required to be independent and homogeneous.

• PROC MIXED is a generalization of the GLM procedure in the sense that PROC GLM fits standard linear models, and PROC MIXED fits the wider class of mixed linear models.

• Both procedures have similar CLASS, MODEL, CONTRAST, ESTIMATE, and LSMEANS statements.

• But their RANDOM and REPEATED statements differ.

Page 39: SAS Procedures

RANDOM and REPEATED Statementsin PROC GLM and PROC MIXED

• The RANDOM statement in PROC MIXED incorporates random effects constituting the vector in the mixed model.

• However, in PROC GLM, effects specified in the RANDOM statement are still treated as fixed as far as the model fit is concerned, and they serve only to produce corresponding expected mean squares.

• The REPEATED statement in PROC MIXED is used to specify covariance structures for repeated measurements on subjects.

• The REPEATED statement in PROC GLM is used to specify various transformations with which to conduct the traditional univariate or multivariate tests.

• In repeated measures situations, the mixed model approach used in PROC MIXED is more flexible and more widely applicable than either the univariate or multivariate approaches.

Page 40: SAS Procedures

PROC GLIMMIX

• The GLIMMIX procedure fits statistical models to data with correlations or nonconstant variability and where the response is not necessarily normally distributed.

• These models are known as generalized linear mixed models (GLMM).

• November 2005: Production level version can now be downloaded from http://support.sas.com/rnd/app/da/glimmix.html

Page 41: SAS Procedures

PROC GLIMMIX (continued)

• The GLMMs, like linear mixed models, assume normal (Gaussian) random effects.

• Conditional on these random effects, data can have any distribution in the exponential family.

• The binary, binomial, Poisson, and negative binomial distributions, for example, are discrete members of this family.

• The normal, beta, gamma, and chi-square distrubtions are representatives of the continuous distributions in this family.

Page 42: SAS Procedures

Summary

• Know what your assumptions are and check them.

• Theory, methods and techniques evolve.

• Consider using– PROC GLIMMIX– Enterprise Guide

• Fit the model to the data!

Page 43: SAS Procedures

References• Akaike, H. (1973), "Information Theory and an Extension of the Maximum Likelihood

Principle," in Petrov and Csaki, eds., "Proceedings of the Second International Symposium on Information Theory," 267-281.  

• Beal, Dennis J. (2005), SAS “Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria”, Proceedings, Southeast SAS Users Group Conference.

• Bickel, Peter J. and Doksum, Kjell A. (2001), Mathematical Statistics, Prentice-Hall, Inc., Upper Saddle River, NJ.

• Cerrito, Patricia B. (2005), “From GLM to GLIMMIX-Which Model to Choose?” Workshop, Southeast SAS Users Group Conference.

• Long, J.Scott (1997), Regression Models for Categorical and Limited Dependent Variables, Thousand Oaks, CA: Sage Publications, Inc.

• McCullagh, P. and Nelder. J. A. (1989), Generalized Linear Models, Second Edition, London: Chapman and Hall.

• Seber, G.A.F. (1984), Multivariate Observations, John Wiley & Sons, New York.• Stokes, M.E., Davis, C.S., and Koch, G.G. (2000), Categorical Data Analysis Using the

SAS System, Second Edition, Cary, NC: SAS Institute Inc. • SAS Online Documentation, http://www.sas.com• GLIMMIX Procedure Documentation, “The GLIMMIX Procedure, Nov. 2005”, SAS

Institute.

Page 44: SAS Procedures

UPCOMING COLLOQUIA

"Using LaTeX for Scientific Publication and Presentation,” Wed., November 30, at 3:30 PM., presented by Ed Hall

----------------------

Please take a minute to complete the feedback form and leave it on the counter as you exit.

Thank you.

The Research Computing Support Center will be closed on Wednesday-Friday, Nov. 23, 24 and 25. We will re-open on Monday, November 28th at 9:00 a.m.

Note: EG project files, programs and other SAS source used in the original presentation are available by request, but they are not contained in this online version - kmg