course module document · web viewnote that the output is in a non-proportional font and when you...

Information Systems

SAS EMLinear

Regression

Sam M. Walton College of Business – Enterprise Systems and Data Analytics

Information Systems

Linear Regression Example

SAS Enterprise Miner 14.1SAS EM has a single regress node that it uses for both linear and logistic regression. The Regression node properties panel provides entries to select whether the model will be linear or logistic; however, SAS EM will automatically use linear if the target is interval level numeric and logistic if the target is categorical.

The purpose of this document is to demonstrate using SAS EM for estimation (prediction) using linear regression. Tuition data will be used for the illustration. The data has one categorical predictor which is set to 0/1 in one dataset and to public/private in another dataset. Note that SAS EM automatically codes categorical variables and the purpose of the two different datasets is to illustrate how to build models using either numeric coding or categorical names. The two datasets are tables in the public_datasets_dm database on Teradata and are named Tuition and TuitionPub_Priv. This document assumes you have create a diagram and have created the two data sources. Recall that you have to run Project Start code that provides access to Teradata.

Dataset using 0/1 for pub_priv predictor

Drag the data source onto a diagram—right click it and select Edit Variables—notice the predictor variables are sorted in alpha ascending order.

Right-click the pub_priv predictor and then click the Explore button and notice all the values are 0/1.


Information Systems

Close the explore window and change the Level for pub_priv to Binary and set tuition as the target variable in the Role column and then click the OK button.

With the Tuition node selected, review the Property panel to the left; notice the Role in the Train section is set to Raw.

To the right of the Tuition node, add a Partition node and set it to be 60% for Training and 40% for validation and connect the Tuition node to the Partition node. Rename the Partition node to Tution 0/1.

Add a Regression node to the right of the Partition node and connect the two nodes; rename the Regression node to Regression 0/1. Ensure the Regression node is selected and review the Property panel. Review the options for the different sections. For our purposes, only main effects will be considered. Note the Model Selection section has a Selection Model drop down box with which includes Backwards, Forward, Stepwise in addition to the default of None. None will include all the predictors.


Information Systems

In the Output Options section, change all the entries to Yes except the Suppress Output option. Note the Design Matix option needs to be selected for the output to include how SAS EM coded categorical variables.

Right-click and run the Regression Node and review the results. Focus on the Effects Plot, the Fit Statistics and the Output windows. Each of these can be maximized and then restored. Move your mouse over the red column, Effect Number 3, to get the popup box containing the name of the predictor, the sign and the coefficient.

SAS outputs a number of Fit Statistics. It makes sense for prediction purposes to compare those that can be computed for both the training sample and validation sample—such as ASE and RMSE.


Information Systems

The Output window contains all the details and the model equation. Note that the output is in a non-proportional font and when you copy results to your Word document, you will also need to use a non-proportional font such as courier new.

The output starts by providing a variable summary and the names it uses for model building.

Variable Summary Measurement Frequency Role Level Count INPUT BINARY 1INPUT INTERVAL 9TARGET INTERVAL 1

Predicted and decision variables Type Variable Label TARGET tuition tuitionPREDICTED P_tuition Predicted: tuitionRESIDUAL R_tuition Residual: tuition

This is followed by quite a lot of information—note the coding it uses for the pub_priv 0/1 predictor. This is followed by frequency information and descriptive statistics for numeric predictors.

Class Level Information DesignClass Value Variablespub_priv 0 1 1 -1 Frequency Distribution of Input Class Variables Class Value Total pub_priv 0 151 1 330 Descriptive Statistics for Continuous Variables StandardVariable Mean Deviation Minimum Maximum accrate 0.743222 0.154333 0.150000 1.000000alumni 23.004158 12.869375 0 63.000000fac_comp 54477 12654 26500 98900fulltime 81.250624 15.999327 24.590000 99.930000graduat 66.191268 17.671446 15.000000 118.000000num_enrl 836.756757 999.335355 35.000000 6180.000000pct_phd 73.629938 15.958157 10.000000 103.000000pcttop25 56.311850 20.594661 11.000000 100.000000sf_ratio 14.032432 4.044285 2.500000 39.800000


Information Systems

The analysis of variance table is following by model fit statistics with the Adjusted R-Square being .751

Analysis of Variance Sum ofSource DF Squares Mean Square F Value Pr > F Model 10 6146940202 614694020 145.87 <.0001Error 470 1980596295 4214035Corrected Total 480 8127536497 Model Fit Statistics R-Square 0.7563 Adj R-Sq 0.7511AIC 7348.0131 BIC 7350.5269SBC 7393.9477 C(p) 11.0000 Type 3 Analysis of Effects Sum ofEffect DF Squares F Value Pr > F accrate 1 2628890.58 0.62 0.4300alumni 1 87037480.5 20.65 <.0001fac_comp 1 400330332 95.00 <.0001fulltime 1 7587235.41 1.80 0.1803graduat 1 51325076.9 12.18 0.0005num_enrl 1 22305582.4 5.29 0.0218pct_phd 1 66873559.4 15.87 <.0001pcttop25 1 2137240.65 0.51 0.4767pub_priv 1 756151880 179.44 <.0001sf_ratio 1 120190281 28.52 <.0001 The following table provides the coefficients, the Pr > |t| which can be used to determine significant variables and the confidence levels. Note the pub_priv predictor has a zero to its right. Referring back to the coding used, a 0 uses a +1 so this coefficient would be used in the model equation when pub_priv is 0.

Also, note that not all predictors are significant—see highlighted values below. Thus, accurate, fulltime and pcttop25 need to be addressed but will not be done here.

Analysis of Maximum Likelihood Estimates Standard 95% ConfidenceParameter DF Estimate Error t Value Pr > |t| Limits Intercept 1 -1053.1 1233.1 -0.85 0.3935 -3470.0 1363.8accrate 1 604.5 765.3 0.79 0.4300 -895.5 2104.5alumni 1 43.6674 9.6084 4.54 <.0001 24.8352 62.4995fac_comp 1 0.1214 0.0125 9.75 <.0001 0.0970 0.1458fulltime 1 9.0533 6.7471 1.34 0.1803 -4.1707 22.2774graduat 1 24.0941 6.9039 3.49 0.0005 10.5627 37.6255num_enrl 1 -0.3051 0.1326 -2.30 0.0218 -0.5650 -0.0452pct_phd 1 34.9672 8.7777 3.98 <.0001 17.7631 52.1712


Information Systems

pcttop25 1 5.0169 7.0446 0.71 0.4767 -8.7903 18.8241pub_priv 0 1 -2063.8 154.1 -13.40 <.0001 -2365.7 -1761.8sf_ratio 1 -155.2 29.0660 -5.34 <.0001 -212.2 -98.2599

Dataset using public/private for pub_priv predictor

Create a model flow as above but using the dataset that uses the public/private for the pub_priv predictor variable.

Edit the data source and not that the pub_priv predictor is now nominal. Rename nodes to appropriate names. Set the Partition node to 60/40 and change properties appropriately for the Regression node. Run the Regression node.

In the results, note that the Class Level section is missing because there is no coding needed as the pub_priv predictor is nominal.

Compare the fit-statistics—they should match exactly.

The Analysis of Maximum Likelihood Estimates now shows pub_priv with Private to its right. Note that the equation is the same except with 0/1 data, 2063.8 is subtracted when public, and with public/private, it is added.

Analysis of Maximum Likelihood Estimates Standard 95% ConfidenceParameter DF Estimate Error t Value Pr > |t| Limits Intercept 1 -1053.1 1233.1 -0.85 0.3935 -3470.0 1363.8accrate 1 604.5 765.3 0.79 0.4300 -895.5 2104.5alumni 1 43.6674 9.6084 4.54 <.0001 24.8352 62.4995fac_comp 1 0.1214 0.0125 9.75 <.0001 0.0970 0.1458fulltime 1 9.0533 6.7471 1.34 0.1803 -4.1707 22.2774gradut 1 24.0941 6.9039 3.49 0.0005 10.5627 37.6255


Information Systems

num_enrl 1 -0.3051 0.1326 -2.30 0.0218 -0.5650 -0.0452pct_phd 1 34.9672 8.7777 3.98 <.0001 17.7631 52.1712pcttop25 1 5.0169 7.0446 0.71 0.4767 -8.7903 18.8241pub_priv Private 1 2063.8 154.1 13.40 <.0001 1761.8 2365.7sf_ratio 1 -155.2 29.0660 -5.34 <.0001 -212.2 -98.2599

For each node in SAS EM, it is easy to view what is coming into the node and what is being exported from the node—perhaps to the next node. Ensure the Regression node is selected and click on the Exported Data ellipsis button.

The Exported Data window opens and you will need to select the appropriate data to browse or explore. Note the right column which indicates what rows of data are available. In this case, the TRAIN, VALIDATE and ESTIMATE datasets exists. In this example, select the Validate row and click the Browse button to view the VALIDATE data.

A table appears with each record and predictor values plus two additional columns on the right – Predicted tuition and Residual Tuition.


Information Systems

You can use this table to verify you know how to plug values into the equations. For example, try record 4 which is Private and record 6 which is Public.One may wish to plot tuition versus predicted tuition on a scatter plot. To do this, add a Graph node to the right of the Regression node and run. The above table will be available in the Graph node.

Further, you can create this scatter plot.


course module document · web viewnote that the output is in a non-proportional font and when you...

Documents