Variable Selection Techniques Implemented Procedures of the SAS Software
Olaf Gefeller1 and Rainer Muche2
IDepartment of Medical Statistics, University of Gottingen, Humboldtallee 32,
W-3400 Gottingen, Germany
2Clinical Documentation, University of Ulm, Schwabstra:6e 13,
W-7900 Ulm, Germany
ABSTRACT
• In
Different variable selection techniques are often employed in practical data analysis as
part of the statistical modelling process to reduce the number of variables to be included
in a 'final' analysis of the relationship under study. For example, in regression-type ap
plications of these techniques it is the goal to obtain a final model fitting appropriately
the observed data which should (i) lead to stable estimates of the model parameters,
(ii) predict accurately future values of the dependent variable based on the knowledge of
all explanatory variables in the model, and (iii) allow a comprehensible interpretation of
the relationship under investigation. The paper describes briefly several popular variable
selection techniques in different statistical contexts and looks for their implementation in
procedures of the SAS software. A complete coverage of all opportunities to use these tech
niques within the SAS software, in particular in the SAS/STAT, SAS/ETS, SAS/OR, and
SAS/QG components, is given. Selection criteria used by the different procedures, statisti
cal details of the involved algorithms and syntactic specifications are compared. Finally, a
critical discussion of the present status of implementation of variable selection techniques
in the SAS software from a statistical point of view is provided.
1. INTRODUCTION
In a variety of different contexts empirical studies are often devoted to the analysis of
the relationship between some response variable and a multitude of explanatory variables
potentially influencing the dependent variable. Mostly, information on numerous explana
tory variables is gathered, and during the phase of data analysis these are 'statistically
screened' to find the most important ones for an appropriate description of the rela
tionship. In this situation variable selection techniques are routinely employed as part of
801
statistical modelling of the observed data to reduce the number of explanatory variables
and to find the best-fitting subsets of variables, respectively.
The process of variable selection procedures can be viewed either as part of the exploratory
data analysis or as part of the inferential process providing the basis for decision making.
In the first case the aim is to seek out the most important explanatory variables as
components of the final model providing an adequate description of the relationship under
study. No additional statistical conclusions based on the final model are drawn in this
situation. In the second case variable selection procedures are commonly used to yield
a final model as a basis for further statistical inference. Estimation and prediction are
conducted in the recommended model as if this subset of explanatory variables has been
chosen a priori (without reference to the data). Both cases have to be distinguished in
the discussion of advantages and disadvantages of variable selection techniques as the
special framework for the application of these procedures leads to different statistical
problems. In the first case the analysis is of purely descriptive nature and hence must not
take account of the multitude of statistical significance tests conducted, while the second
approach needs to pay attention to the inflation of the type I error for the whole selection
procedure.
A problem frequently encountered in practical data analysis arises due to multicollinearity
meaning a nearly exact linear relation between potential explanatory variables. It may
result in parameter estimates revealing a high variance which consequently are unreli
able and can be far away from the true value. Some variable selection procedures fail
to recognize special variables or even important factor combinations in the presence of
multicollinearity. Variables that seem important when analyzed alone, may appear non
influential in the presence of others, and conversely, the effect of some variables is only
observed in the presence of others. Therefore, parameter estimates obtained in each model
considered during the selection process may not be reliable, either in magnitude or in di
rection. Usually, in multicollinear data, there is not a unique order of importance of single
variables and hence, variable selection or elimination based on considering only one vari
able at a time may be misleading.
In general, variable selection techniques as part of statistical modelling are applied to
obtain an appropriate model fitting the data at hand which should (i) lead to stable
parameter estimates, (ii) predict accurately future values, and (iii) allow a comprehensible
interpretation since only a few important explanatory variables are selected. The choice
of a particular variable selection algorithm to be employed in practice depends usually
not only on statistical properties but - more importantly - on the availability of the
corresponding software. Since SAS is the leading software package for statistical analyses,
it is important to investigate which types of variable selection techniques are implemented
802
in different components of the SAS software as this will have a major impact on the current
practice of variable selection in applications. This paper serves as a guide to the different
realizations of variable selection techniques in various procedures.
2. STATISTICAL BACKGROUND
2.1. Stepwise Methods
A group of popular and commonly employed approaches to select variables has been
termed stepwise methods; these are described in detail in most textbooks on regression
methods (e.g. Draper and Smith, 1981). Three main techniques can be identified: forward
selection, backward elimination, and stepwise regression. These procedures add and/or
delete variables according to some special criterion, for example, a statistical test of the
estimated model parameter corresponding to the variable under study.
The forward selection procedure starts with an 'empty' regression equation and succes
sively adds one variable at a time until all explanatory variables are selected or until a
stopping rule is satisfied. The criterion for variable selection relates to the value of the test
statistic for the single parameter in the regression model including all variables already
entered and the variable under consideration. At any step the variable with the largest
value of the test statistic is added to the current regression equation if its value exceeds
a prespecified critical bound depending on the chosen significance level. Otherwise, the
procedure stops, and the current regression model is referred to as the final model.
In contrast, the backward elimination procedure begins with the full model including all
explanatory variables which are eliminated one at a time. A variable is excluded from
the regression equation at any step if its test statistic derived from a regression model
consisting of all non-eliminated variables has the smallest value and does not exceed a
specified critical bound. If there is no further variable fulfilling the elimination criterion,
the current model is called final model.
Stepwise regression is a procedure combining both techniques. It is a forward orientated
approach incorporating partly the backward idea. The procedure starts like forward selec
tion including one variable at a time, but after each selection step an additional elimination
step is inserted. Thus, it is possible to exclude variables added previously if these variables
are less influential in the current model. The procedure yields the final model when no
further inclusions or exclusions of explanatory variables are necessary and possible.
The stepwise methods attempt to obtain a final model by employing similar statistical
techniques, but they approach the problem from different directions. Although they select
803
variables by applying similar criteria, they need not yield identical final models. Illustra
tive examples have been provided in the literature that forward selection and backward
elimination can lead to drastically different results (McGee et al., 1984). It is even pos
sible that the first variable entered in forward selection is the first variable deleted in
backward elimination (Gunst and Mason, 1980). Difficulties may arise in model building
when a variable selected at the beginning will be unnecessary or an effect of a variable
is only observed in the presence of others. Obviously, under such circumstances stepwise
methods cannot lead to equal and reliable final models (Hocking, 1976). In addition, im
portant models may be overlooked by all stepwise techniques because of the restriction of
considering only one explanatory variable at a time (Mantel, 1970).
Since variable selection and elimination are subject to the prespecified significance level,
this leads to another criticism of stepwise methods. Strategies that use a significance level
for entering variables ignore the number of multiple comparisons actually made (Harrell
et al., 1984). Employing statistical tests with a prespecified significance level protects at
each selection or elimination step against erroneously including an in fact non-influential
variable, but there is no protection against the overall error of the inclusion of at least
one variable in the final model which actually has no effect on the response variable (see
e.g. Aitkin, 1971).
2.2. Other Methods
Although stepwise methods represent the most popular group of variable selection pro
cedures (at least in regression-type applications), the repertory of computational tools
attempting to find the important variables for a final analysis of the observed data is not
nearly limited to this group. A variety of other variable selection procedures have been
proposed in the literature, for an overview of regression-based methods like, for example,
best subset selection or all possible regressions see Draper & Smith (1981).
For medical applications Harrel et al. (1984) has recommended incomplete principal com
ponents regression to improve the predictive quality of the final model. Principal compo
nents regression poses special restrictions on the parameters leading to a reduction in the
number of parameters to be estimated.
Another variable selection technique, the CART method (classification and regression tree)
originally introduced by Breiman et al. (1984), has recently gained popularity in data
analysis. Implementation of CART leads to the identification of hierarchically ordered
homogeneous subgroups defined by important explanatory variables.
The tree structure testing (TST) strategy (Commenges et al., 1989) is a further variable
selection algorithm based on significance tests on groups of explanatory variables ordered
804
I i ~. ~
in a tree structure. The procedure resembles Fisher's least significant difference test in
troduced in the context of the analysis of variance (Fisher, 1935). Its principal advantage
lies in the opportunity to control the type I error of the whole seleCtion procedure, i.e.
the probability of selecting at least one variable which actually has no influence on the
response. A lot of multiple comparison procedures can be applied to the tree of hypotheses
to construct such a strategy holding the multiple significance level, for details see Gefeller
& Kron (1992).
In epidemiologic applications another procedure for variable selection, termed change-in
estimate method, is extensively used for the purpose of reducing the number of variables
to be included in the final model (Greenland, 1989). This method is based on the sub
jective evaluation of changes in the parameter estimate of the explanatory variable of
primary interest from model to model. Thus, this technique is only applicable in special
situations when the interest lies in analyzing the relationship between the response and
one independent variable controlled for the confounding influence of other variables to be
identified by the change-in-estimate method.
3. IMPLEMENTATION IN PROCEDURES OF THE SAS SOFTWARE
This section contains a brief overview of all realizations of variable selection techniques
in procedures of the SAS software. Although we screened all different SAS components
which could, at least theoretically, cover procedures incorporating some form of variable
selection (SASjSTAT, SASjETS, SASjOR, SASjQC), we found such procedures only
in the SASjSTAT component (SAS Institute Inc., 1990). Table 1 presents a list of all
these procedures and further indicates which type of selection technique is offered by the
procedures. Afterwards the different SAS procedures are briefly explained with respect to
their capability of performing automatic variable selection.
Table 1: Procedures of the SASjSTAT software offering some form of variable selection
Name of Type of variable selection
SAS procedure stepwise methods best subset selection other techniques
PROC REG x x x
PROC LOGISTIC x x -
PROG PHREG x ·x -
PROC STEPDISC x - -
PROC VARCLUS - - x
805
. '. ". ~ - . -- -- ., ~ -," . ~ ~ •• c. _ J'_' ____ _
:., ....
PROC REG: a general procedure for linear regression modelling. Several different tech-
niques for automatic variable selection are implemented in PROC REG. All stepwise
methods (forward selection, backward elimination, stepwise'regression) are covered.
In this context, the selection criterion relates to the p-value of the F -statistic for
an explanatory variable which reflects the variable's contribution to the linear re
gression model. The user can specify bounds for the p-value (SLENTRY, SLSTAY)
to control the variable selection process. Best subset selection methods are imple-
mented employing R2, adjusted R2 and Mallow's Cp as measures to judge the model
fit leading to the selection of the best subset of explanatory variables of a given size.
Two additional selection procedures (MAXR, MINR) are included which imitate a
best subset selection based on R2; however, these procedures are computationally
faster at the expense that they may overlook the best subset.
PROC LOGISTIC: a procedure for nonlinear regression modelling of categorical data
utilizing the logit transformation of the response probabilities. Four different tech
niques for automatic variable selection are implemented in PROC LOGISTIC. All
stepwise methods (forward selection, backward elimination, stepwise regression) are
covered. In this context, the selection criterion relates to the p-value of the ad
justed Wald chi-square statistic for an explanatory variable which reflects the vari
able's contribution to the logistic regression model. The user can specify bounds
for the p-value (SLENTRY, SLSTAY) to control the variable selection process. In
SAS/STAT version 6.07, best subset selection has been implemented based on the
likelihood score statistic as a measure to judge the model fit leading to the selection
of the best subset of explanatory variables of a given size. This selection method
uses the branch and bound algorithm proposed by Furnival and Wilson (1974) to
speed up the computational process.
PROC PHREG: a procedure for analyzing censored failure times according to the
semiparametric proportional hazards regression model proposed by Cox (1972). This
procedure has been included in the SAS/STAT software in version 6.07. Four differ
ent techniques for automatic variable selection are implemented in PROC PHREG.
All stepwise methods (forward selection, backward elimination, stepwise regression)
are covered. In this context, the selection criterion relates to the p-value of the
adjusted Wald chi-square statistic for an explanatory variable which reflects the
/ variable's contribution to the proportional hazards regression model. The user can
specify bounds for the p-value (SLENTRY, SLSTAY) to control the variable se
lection process. Best subset selection is implemented based on the likelihood score
statistic as a measure to judge the model fit leading to the selection of the best sub
set of explanatory variables of a given size. This selection method uses the branch
806
i··
and bound algorithm proposed by Furnival and Wilson (1974) to speed up the com
putational process.
PROC STEPDISC: a procedure specifically designed to perform stepwise methods in
the context of discriminant analysis. The procedure selects variables to produce a
discrimination model that can be useful for discriminating between several classes.
All stepwise methods (forward selection, backward elimination, stepwise regression)
are covered. In this context, two selection criteria are offered. One relates to the
p-value of the F-statistic for an explanatory variable in an ANCOVA model taking
all variables already chosen as covariates and the specific variable under consider
ation as the dependent factor in the model. The other criterion uses the partial
correlation coefficient for predicting the variable under study. The user can spec
ify bounds for the p-value (SLENTRY, SLSTAY) and for the partial correlation
coefficient (PR2ENTRY, PR2STAY) to control the variable selection process.
PROC VARCLUS: a procedure performing cluster analysis on sets of variables. The
clusters are chosen to maximize the variation accounted for by either the first prin
cipal component or the centroid component of each cluster. Thus, the procedure
can be used to reduce the number of variables in an analysis employing selection
criteria concerning either first principal component or centroid components of the
clusters. The user can specify the percentage of variation that must be explained
by the cluster component or the largest permissible value of the second eigenvalue
in the cluster components to control the variable selection process. However, it is
evident that variable selection in cluster analysis is based on a completely different
statistical background compared to regression-based applications.
4. DISCUSSION
The practice of statistical modelling of observed data depends critically on the availability
of easy-to-comprehend software covering the necessary computational tasks. In partic
ular, the routine application of variable selection methods, which usually involve rather
complex computational operations, is only feasible if appropriate software assists the user
during this phase of data analysis. This paper has documented the current situation of
implementation of variable selection techniques in the SAS software, the major inter
national software package for statistical data analysis. The synopsis of all implemented
selection algorithms has revealed that the current focus of the SAS software lies in step
wise methods and best subset selection. Other important techniques like, for example,
807
i,
'.'
the CART method or the TST strategy, have been completely neglected, although these
procedures are popular tools in a variety of applications. Future developments in the SAS
software should take care of this deficit and close the gap between users' demands of more
implemented variable selection techniques and the current SAS software reality.
A solution to the problem of constructing an appropriate variable selection algorithm
applicable to all situations of variable selection in data analysis constitutes a challenging
statistical problem for which an accepted standard solution is not apparent. The different
strategies exhibit positive as well as negative properties depending on the demands raised
under specific circumstances. Consequently, no variable selection technique can be recom
mended for general use in all applications. The specific requirements of the application
have to be considered and the decision to use a particular procedure has to be based on
these considerations. The multiple comparison problem in variable selection algorithms,
i.e. the proper statistical control of the overall error of the selection procedure, has to be
taken into account whenever the final model forms the basis of further statistical infer
ence. The common practice to ignore the impact of variable selection procedures on the
statistical error rates needs correction. Finally, it should be recognized in data analysis
that all variable selection techniques must be used extremely carefully, especially in the
situation of multicollinearity frequently encountered in practical applications, as none of
the methods can be guaranteed to yield satisfactory results.
REFERENCES
Aitkin, M.A. (1971). Statistical theory (behavioural science application). Ann. Rev. Psy
cho!. 22, 225-250.
Breiman, L., Friedman, J.H., Ohlsen, R.A., Stone, C. (1984). Classification and regression
trees. Wadsworth, Belmont CA.
Commenges, D., Dartigues, J.F., Peytour, P., Puymirat, E., Henry, P., Gagnon, M. (1989).
A strategy for analysing multiple risk factors with application to cervical pain syndrome.
Meth. Inform. Medicine 28, 14-19.
Cox, D.R. (1972). Regression models and life tables (with discussion). J. Royal Stat. Soc.
B. 34, 187-220.
Draper, N.R., Smith, H. (1981). l'l.pplied regression analysis, (second edition). John Wiley,
New York.
Fisher, R.A. (1935). The design of experiments. Oliver & Boyd Ltd., Edinburgh.
808
Furnival, G.M., Wilson, R.W. (1974). Regression by leaps and bounds. Technometrics 16,
499-511.
Gefeller, 0., Kron, M. (1992). Controlling the multiple level of significance in variable
selecting algorithms: an improved version of the TST strategy. In MEDICOMP '92 -
Application of computational and cybernetic methods in medicine and biology, eds. T. Asztalos, J. Eller and I. Gyori, pp. 101-108. SZOTE, Szeged Hungary.
Greenland, S. (1989). Modelling and variable selection in epidemiologic analysis. Am. J. Public Health 79, 340-349.
Gunst, R.F., Mason, R.L. (1980). Regression analysis and its application. Marcel Dekker,
New York.
Harrell, F.E., Lee, K.L., Califf, R.M., Pryor, D.B., Rosati, R.A. (1984). Regression mod
elling strategies for improved prognostic prediction. Statistics in Medicine 3, 143-152.
Hocking, R.R. (1976). The analysis and selection of variables in linear regression. Tech
nometrics 18, 425-438.
Mantel, N. (1970). Why stepdown procedures in variable selection. Technometrics 12,
621-625.
McGee, D., Reed, D., Yang, K. (1984). The results of logistic analyses when the variables
are highly correlated: An empirical example using diet and CHD incidence. J. Chron. Dis.
37, 713-719.
SAS Institute Inc. (1990). SAS/STAT User's Guide,4th Edition. SAS Institute Inc., Cary,
NC.
Address for correspondence:
Dr. Olaf Gefeller
Abteilung Medizinische Statistik
Georg-August-Universitat Gottingen
Humboldtallee 32
W-3400 Gottingen
Germany
SAS, SAS/STAT, SAS/ETS, SAS/OQ, and SAS/QC are registered trademarks of
SAS Institute Inc., Cary, NC, USA.
809