chapter 4 variable selectionrt/cs5710/f07/notes/chap4... · 2007-11-06 · 4.1 variable selection...

Chapter 4 Variable Selection

4.1 Variable Selection and Enterprise Miner ......................................................................4-3

4-2 Chapter 4 Variable Selection

4.1 Variable Selection and Enterprise Miner 4-3

4.1 Variable Selection and Enterprise Miner

2

ObjectivesDiscuss the need for variable selection.Explain the methods of variable selection available in Enterprise Miner.Demonstrate the use of different variable selection methods.

3

The Curse of Dimensionality

1–D

2–D

3–D

Recall that as the number of input variables to a model increases there is an exponential increase in the data required to densely populate the model space. If the modeling space becomes too sparsely populated, the ability to fit a model to noisy (real) data is hampered. In many cases although the number of input variables is large, some variables may be redundant and others irrelevant.

The difficulty is in determining which variables can be disregarded in modeling without leaving behind important information.


4

Methods of Variable SelectionStepwise RegressionDecision TreesVariable Selection Node

Three possible methods of variable selection available in Enterprise Miner are • stepwise regression • decision trees • the variable selection node.

Another possible method for dimension reduction is principal components analysis. Principal components are uncorrelated linear combinations of the original input variables; they depend on the covariance matrix or the correlation matrix of the original input variables.

5

Stepwise RegressionUses multiple regression p-values to eliminate variables.May not perform well with many potential input variables.

As you saw earlier, stepwise regression methods can be used for variable selection. However, this method was not designed for use in evaluating data sets with dozens (or hundreds) of potential input variables and may not perform well under such conditions.


6

Decision TreesGrow a large tree.Retain only the variables important in growing the tree for further modeling.

In the tree node, a measure of variable importance is calculated. The measure incorporates primary splits and any saved surrogate splits in the calculation. The importance measures are scaled between 0 and 1, with larger values indicating greater importance. Variables that do not appear in any primary or saved surrogate splits have zero importance.

By default, variables with importance less than 0.05 are given a model role of rejected, and this status is automatically transferred to subsequent nodes in the flow.

No particular tree-growing or pruning options are preferable for variable selection. However, in pruning, it is usually better to err on the side of complexity as opposed to parsimony. Severe pruning often results in too few variables being selected. When presented with a range of trees of similar performance, selecting a bushier tree is often more useful.


7

Variable Selection NodeSelection based on one of two criteria:

R-square chi-square – for binary targets only.

The variable selection node provides selection based on one of two criteria. By default, the node removes variables unrelated to the target.

When you use the R-square variable selection criterion, a three-step process is followed:

1. Enterprise Miner computes the squared correlation for each variable and then assigns the rejected role to those variables that have a value less than the squared correlation criterion (default 0.005).

2. Enterprise Miner evaluates the remaining (not rejected) variables using a forward stepwise R2 regression. Variables that have a stepwise R2 improvement less than the threshold criterion (default 0.0005) are assigned the rejected role.

3. For binary targets, Enterprise Miner performs a final logistic regression using the predicted values that are output from the forward stepwise regression as the only input variable.

If the target is nonbinary, only the first two steps are performed.

When you use the chi-square selection criterion, variable selection is performed using binary splits for maximizing the chi-square value of a 2x2 frequency table. Each level of the ordinal or nominal variables is decomposed into binary dummy variables. The range of each interval variable is divided into a number of categories for splits. These bins are equally sized intervals and, by default, interval inputs are binned into 50 levels.


Variable Selection with Decision Trees and the Variable Selection Node

Return to the nonprofit diagram created in Chapter 3, “Predictive Modeling Using Regression.” Add a Tree node after the Data Partition node, and add a Variable Selection node after the Transform Variables node. Your workspace should appear as shown below:


Variable Selection Using a Decision Tree

1. Open the new Tree node.

2. Select the Basic tab.

3. Select Gini reduction for the splitting criterion.

4. Change the maximum depth of the tree to 8 to allow a larger tree to grow.

5. Select the Advanced tab and change the model assessment measure to Total Leaf Impurity (Gini index).

These selections for splitting criterion and model assessment measure are chosen because they tend to result in larger (bushier) trees, which is desirable for variable selection.


6. Select the Score tab and then the Variables subtab. This tab controls in part how the Tree node will modify the data set when the node is run.

7. Close the tree node, saving changes when prompted. It is unnecessary to specify a different name for the model, but you may do so if desired when prompted. Select OK to exit.

8. Run the flow from the Tree node and select Yes to view the results when prompted.

The 38-leaf tree has the lowest total impurity on the validation data set, so it is selected by default.


9. Select the Score tab and then select the Variable Selection subtab. This subtab shows you which variables have been retained and which have been rejected.

The chosen tree retains 15 of the 18 variables for analysis. You could add a Regression node or Neural Network node to the flow following this Tree node. However, because no data replacement or variable transformations have been performed, you should consider doing these things first (for the input variables identified by the variable selection). In general, a tree with more leaves retains a greater number of variables, whereas a tree with fewer leaves retains a smaller number of variables.

10. Close the tree results.


Variable Selection Using the Variable Selection Node

1. Open the Variable Selection node.

2. Select the Manual Selection tab. This tab enables you to force variables to be included or excluded from future analyses. By default, the role assignment is automatic, which means that the role is set based on the analysis performed in this node.


3. Select the Target Associations tab. This tab enables you to choose one of two selection criteria and specify options for the chosen criterion. By default, the node removes variables unrelated to the target (according to the settings used for the selection criterion) and scores the data sets.

Selection Using the R-square Criterion

1. Consider the settings associated with the default R-square criterion first. Because R-square is already selected as the selection criterion, click on the Settings… button on the Target Association tab.

Recall that a three-step process is performed when you apply the R2 variable selection criterion to a binary target. The Setting window allows you to specify the cutoff squared correlation measure and the necessary R2 improvement for a variable to remain as an input variable.

Additional available options enable you to

• test 2-way interactions. When this option is selected, Enterprise Miner evaluates 2-way interactions for categorical inputs. If this option is not selected, Enterprise Miner still evaluates the 2-way interactions, but it does not allow them to be passed to successor nodes as input variables.

• bin interval variables in up to 16 bins. When selected, this option requests Enterprise Miner to bin interval variables into 16 equally spaced groups (AOV16) . The AOV16 variables are created to help identify nonlinear relationships with the target. Bins with zero observations are eliminated, meaning an AOV16 variable can have fewer than 16 bins


• use only grouped class variables. When this option is selected, Enterprise Miner uses only the grouped class variable to evaluate variable importance. If this option is not selected, Enterprise Miner uses the grouped class variable as well as the original class variable in evaluating variable importance. This may greatly increase processing time.

2. Leave the default settings and close the node.

3. Run the flow from the Variable Selection node and view the results.

4. Click on the Role column heading to sort the variable by their assigned roles. Then click on the Rejection Reason column heading. Inspect the results.

LASTT, CARD8EF, and G_AVGG_NDA are retained. If you scroll to the right to look at the Label column, you see that CARD8EF is the log of CARDGIFT and G_AVGG_NDA is the variable AVGGIFT bucketed. The variable CARD8EF was created in the Transform Variables node, and G_AVGG_NDA was created here in the Variable Selection node.


5. Select the R-square tab.

This tab shows the R-square for each effect with TARGET_B. Note that some of the effects are interaction effects that will not be passed to the successor node.

6. Select the Effects tab.

This tab shows the total R-squared as each selected variable is added into the model.

7. Close the Results window.


Selection Using the Chi-square Criterion

1. Open the Variable Selection node. Select the Target Associations tab and then select the Chi-square criterion.

2. Click on the Settings… button and inspect the options.

As discussed earlier, variable selection is performed using binary splits for maximizing the chi-square values of a 2x2 frequency table. There are three settings that control parts of this process:

Bins This option determines the number of categories in which the range of each interval variable is divided for splits. By default, interval inputs are binned into 50 levels.

Chi-square This option governs the number of splits that are performed. This value is a minimum bound for the chi-square value to decide whether it is eligible for making a variable split. By default, the chi-square value is set to 3.84. As you increase the chi-square value, the procedure performs fewer splits.

Passes By default, the node makes 6 passes through the data to determine the optimum splits.

3. Select Cancel to close the Settings window without making any changes.

4. Close the Variable Selection node. Select Yes to save changes when prompted.


5. Rerun the flow from the Variable Selection node and select Yes to view the results when prompted.

6. Click on the Role column heading to sort the variable by their assigned roles. Then click on the Rejection Reason column heading. Inspect the results.

Fifteen variables have been retained as input variables.

If the resulting number of variables is too high, consider increasing the chi-square cutoff value. Increasing the chi-square cutoff value generally reduces the number of retained variables.

chapter 4 variable selectionrt/cs5710/f07/notes/chap4... · 2007-11-06 · 4.1 variable selection...

Documents