sas notes_two stage modeling using enterprise miner software (2002)

Two-Stage Modeling Using Enterprise Miner™

Software

Course Notes

Two-Stage Modeling Using Enterprise Miner™ Software Course Notes was developed by Jim Georges. Additional contributions were made by Bob Lucas and Mike Patetta. Editing and production support was provided by the Curriculum Development and Support Department.

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.

Two-Stage Modeling Using Enterprise Miner™ Software Course Notes

Copyright 2002 by SAS Institute Inc., Cary, NC 27513, USA. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.

Book code 59185, course code PMMS, prepared date 17Oct02.

For Your Information iii

Table of Contents

Course Description ...................................................................................................................... iv

Prerequisites ................................................................................................................................. v

General Conventions ................................................................................................................... vi

Chapter 1 Constructing a Two-Stage Predictive Model ...................................... 1-1

1.1 Introduction......................................................................................................................1-3

1.2 Defining Case Conditional Profits .................................................................................1-14

1.3 Constructing Component Models ..................................................................................1-23

1.4 Improving Two-Stage Predictions..................................................................................1-43

1.5 Joint Predictive Models..................................................................................................1-75

Chapter 2 Explaining a Two-Stage Model............................................................. 2-1

2.1 Types of Explanations......................................................................................................2-3

2.2 Explaining with Trees ......................................................................................................2-4

2.3 Explaining by Example....................................................................................................2-9

2.4 Creating a Surrogate Model ...........................................................................................2-12

iv For Your Information

Course Description

This course continues the development of predictive models that begins in the Predictive Modeling Using Enterprise Miner™ Software course. Students learn to construct and evaluate two-stage and other multi-stage models using Enterprise Miner. Without using multi-stage modeling techniques, businesses may inaccurately estimate customer value, which results in decisions that adversely affect profits. This course teaches the most appropriate analytic techniques for a particular campaign.

To learn more…

A full curriculum of general and statistical instructor-based training is available at any of the Institute’s training facilities. Institute instructors can also provide on-site training.

For information on other courses in the curriculum, contact the SAS Education Division at 1-919-531-7321, or send e-mail to [email protected]. You can also find this information on the Web at www.sas.com/training/ as well as in the SAS Training Course Catalog.

For a list of other SAS books that relate to the topics covered in this Course Notes, USA customers can contact our SAS Publishing Department at 1-800-727-3228 or send e-mail to [email protected]. Customers outside the USA, please contact your local SAS office.

Also, see the Publications Catalog on the Web at www.sas.com/pubs for a complete list of books and a convenient order form.

For Your Information v

Prerequisites

Before attending this course, you should

• have completed the Predictive Modeling Using Enterprise Miner™ Software course

• have some experience with creating and managing SAS data sets, which you can gain from the SAS®

Programming I: Essentials course.

It is also recommended that you have completed the Neural Network Modeling course.

vi For Your Information

General Conventions This section explains the various conventions used in presenting text, SAS language syntax, and examples in this book.

Typographical Conventions

You will see several type styles in this book. This list explains the meaning of each style:

UPPERCASE ROMAN is used for SAS statements, variable names, and other SAS language elements when they appear in the text.

italic identifies terms or concepts that are defined in text. Italic is also used for book titles when they are referenced in text, as well as for various syntax and mathematical elements.

bold is used for emphasis within text.

monospace is used for examples of SAS programming statements and for SAS character strings. Monospace is also used to refer to field names in windows, information in fields, and user-supplied information.

select indicates selectable items in windows and menus. This book also uses icons to represent selectable items.

Syntax Conventions

The general forms of SAS statements and commands shown in this book include only that part of the syntax actually taught in the course. For complete syntax, see the appropriate SAS reference guide.

PROC CHART DATA= SAS-data-set; HBAR | VBAR chart-variables </ options>; RUN;

This is an example of how SAS syntax is shown in text: • PROC and CHART are in uppercase bold because they are SAS keywords. • DATA= is in uppercase to indicate that it must be spelled as shown. • SAS-data-set is in italic because it represents a value that you supply. In this case,

the value must be the name of a SAS data set. • HBAR and VBAR are in uppercase bold because they are SAS keywords. They are separated by a

vertical bar to indicate they are mutually exclusive; you can choose one or the other. • chart-variables is in italic because it represents a value or values that you supply. • </ options> represents optional syntax specific to the HBAR and VBAR statements.

The angle brackets enclose the slash as well as options because if no options are specified you do not include the slash.

• RUN is in uppercase bold because it is a SAS keyword.

Chapter 1 Constructing a Two-Stage Predictive Model

1.1 Introduction.....................................................................................................................1-3

1.2 Defining Case Conditional Profits...............................................................................1-14

1.3 Constructing Component Models...............................................................................1-23

1.4 Improving Two-Stage Predictions...............................................................................1-43

1.5 Joint Predictive Models ...............................................................................................1-75

1-2 Chapter 1 Constructing a Two-Stage Predictive Model

1.1 Introduction 1-3

1.1 Introduction

3

Predictive Modeling Example

From population of lapsing donors, identify individuals worth continued solicitation.

Business:

Objective:

National veterans’ organization

Source: 1998 KDD-Cup Competition via UCI KDD Archive

A national veterans’ organization seeks to better target its solicitations for donation. By only soliciting the most likely donors, less money will be spent on solicitation efforts and more money will be available for charitable concerns. Solicitations involve sending a small gift to an individual together with a request for donation. Gifts include mailing labels and greeting cards.

The organization has more than 3.5 million individuals in its mailing database. These individuals have been classified by their response behavior to previous solicitation efforts. Of particular interest is the class of individuals identified as lapsing donors. These individuals have made their most recent donation between 12 and 24 months ago. The organization has found that by predicting the response behavior of this group, they can use the model to rank all 3.5 million individuals in their database. With this ranking, a decision can be made to either solicit or ignore an individual in the current solicitation campaign. The current campaign refers to a greeting card mailing sent in June of 1997. It is identified in the raw data as the 97NK campaign.

The source of this data is the Association for Computing Machinery’s (ACM) 1998 KDD-Cup competition. In the competition, two data sets were provided to contestants. First, a training data set was provided with known response information. Second, a scoring data set was provided with response information removed. Contestants were tasked with building a model using the training data and deploying the model on the scoring data to select cases for solicitation. The judges then totaled the response amount for the selected cases minus a solicitation cost of $0.68 per case. The model with the highest net profit was declared the winner.

The data set and other details of the competition are publicly available at the UCI KDD Archive at http://kdd.ics.uci.edu.


4

1998 KDD-Cup Results

1.2.3.4.5.6.7.8.9.

10.

$14,71214,66213,95413,82513,79413,59813,04012,29811,42311,276

TotalProfitRank

$0.1530.1520.1450.1430.1430.1410.1350.1280.1190.117

OverallAvg. Profit

11.12.13.14.15.16.17.18.19.20.

$ 10,72010,70610,11210,0499,7419,4645,6835,4841,9251,706

TotalProfitRank

$ 0.1110.1110.1050.1040.1010.0980.0590.0570.0200.018

OverallAvg. Profit

$10,560$ 0.110

Total profitAvg. profitfor “solicit everyone”

model

The results of the 1998 KDD-Cup produced a surprise. Almost half of the entrees yielded a total profit on the validation data that was less than that obtained by soliciting everyone. Soliciting everyone is the correct decision based on a model that assigns the prior probability to every case in the scoring population.

How did such an astonishing result occur? Part of the answer lies in the method used to select cases for solicitation.


Building a Response Model

In this demonstration, a model predicting response to solicitation is built from 1998 KDD-Cup competition data. A detailed preparation for the techniques used is offered in the course Predictive Modeling Using Enterprise Miner™ Software.

1. Open Enterprise Miner and create a project named PVA Analysis.

2. Create a diagram named Two-Stage Model in the PVA Analysis project.

3. Assemble the following diagram:

Open the Input Data Source node and make the following changes:

1. Select the data set CRSSAMP.PVA_RAW_DATA.

2. Change the model role for TARGET_B to target.


3. Change the model role for TARGET_D to rejected.

4. Change the measurement for all ordinal inputs to interval.

5. Add the following profit matrix and set its status to use:

Each row, or LEVEL, of the profit matrix represents a distinct value of TARGET_B: responder (1) and non-responder (0). Each column of the profit matrix defines a distinct course of action: solicit (1) and ignore (0). On the average, soliciting a responder (1,1) results in a net profit of $14.62, whereas soliciting a non-responder (0,1) results in a net profit of -$0.68. Ignoring responders (1,0) and non-responders (0,0) results in a net profit of zero.

6. Add the following prior vector and set its status to use:


The column Prior Probability defines the proportion of each target value in the sample population. While PVA_RAW_DATA contains 25% responders and 75% non-responders, the population from which it was drawn contains 5% responders and 95% non-responders. The prior vector instructs Enterprise Miner to adjust the predicted probabilities to reflect the population response rates rather than the sample response rates.

7. Close and save the changes to the Target Profiles and Input Data Source nodes.

Enterprise Miner is now configured to build predictive models using the PVA_RAW_DATA. The profit matrix is used to tune the predictive models to maximize profit and select individual cases worthy of solicitation.

Open the Data Partition node and make the following changes:

1. Set the partition percentages to 50% Train and 50% Validation.

2. Select Stratified for the partition method.

3. Select the Stratification tab and set the status of TARGET_B to use.

4. Close and save changes to the Data Partition node.

Enterprise Miner will partition the data set PVA_RAW_DATA such that half will be used for training (fitting) predictive models and half will be used for validating the predictive models. The stratification option ensures the same proportion of responders and non-responders in each partition.


Next, open the Replacement node and make the following changes:

1. Select the Create imputed indicator variables option.

2. Select input as the default role for the imputed indicator.

3. Close and save changes to the Replacement node.

The Replacement node defines a method for filling in missing input values. If an interval input contains a missing value, the mean of the non-missing input values for the input replaces the missing value. If a nominal input contains a missing value, the most common non-missing level for the input replaces the missing value. The Create imputed indicator variables option creates a set of indicator inputs (one for each input in the training data) to distinguish the actual input values from the replaced values.

Finally, open the Regression node and make the following change:

1. Select the Selection Method tab and select Stepwise for the method.

2. Close and save changes to Regression node.

3. Name the saved model Stepwise.

The Regression node will construct a sequence of logistic regression models using stepwise selection. Each model in the sequence will be evaluated using the profit matrix defined in the input data source node. The model with the highest profit calculated on the validation data will be selected as final model by the Regression node.

Run the diagram from the Regression node and view the results.


Select the Output tab and scroll up from the end of the procedure output.

This can be done quickly by typing Ctrl-END and then scrolling up.

The Output window displays the following information:

The DMREG Procedure Summary of Stepwise Procedure Effect Number Score Wald Pr > Step Entered DF In Chi-Square Chi-Square Chi-Square 1 FREQUENCY_STATUS_97NK 1 1 201.5 . <.0001 2 PEP_STAR 1 2 45.5465 . <.0001 3 INCOME_GROUP 1 3 37.1625 . <.0001 4 MONTHS_SINCE_LAST_GIFT 1 4 23.7966 . <.0001 5 MEDIAN_HOME_VALUE 1 5 16.2189 . <.0001 6 MONTHS_SINCE_FIRST_GIFT 1 6 9.8463 . 0.0017 7 RECENT_CARD_RESPONSE_PROP 1 7 10.1530 . 0.0014 8 RECENT_AVG_GIFT_AMT 1 8 7.9078 . 0.0049 9 M_INCOME 1 9 6.9136 . 0.0086 10 DONOR_AGE 1 10 6.0774 . 0.0137 The selected model, based on the CHOOSE=VDECDATA criterion, is the model trained in Step 8. It consists of the following effects: Intercept FREQUENCY_STATUS_97NK INCOME_GROUP MEDIAN_HOME_VALUE MONTHS_SINCE_FIRST_GIFT MONTHS_SINCE_LAST_GIFT PEP_STAR RECENT_AVG_GIFT_AMT RECENT_CARD_RESPONSE_PROP

As stated in the output, the best model (based on validation profit) was the model from step 8 of the stepwise procedure. The output listing shows the variables included in the model.

Select the Statistics tab of the Results window and scroll to the end of the table.

The fit statistic _APROF_ shows an overall average profit of about $0.171 per case in the training data and $0.155 per case in the validation data.

It seems this model would have been quite competitive in the 1998 KDD Cup. A more definitive assessment requires scoring the actual competition data. Enterprise Miner can easily accommodate such a calculation.


Add the Input Data Source, Score, and SAS Code nodes to the diagram as shown:

Open the Input Data Source node and make the following changes:

1. Select the data set CRSSAMP.PVA_SCORE_DATA.

2. Change the data set role to SCORE.

3. Close and save changes to the Input Data Source node.

Open the Score node and make the following change:

1. Select Apply training data score code to score data set.

2. Close and save changes to the Score node.

The new Input Data Source node selects the complete competition scoring data set (with 96,367 cases). The Score node applies the logistic regression model, and appends predicted probabilities, solicitation decisions, and other information to the selected scoring data.


The scoring data does not include the actual response amounts. These amounts must be merged with the scoring data and then summed to evaluate model performance. This will be done in the SAS Code node.

1. Open the SAS Code node.

2. Select File Import File. The Open window opens.

3. Select and open the SAS program merge and sum.sas. data score_results/view=score_results; merge &_SCORE crssamp.pva_results; run; proc sql; select sum(TARGET_D-0.68) as TOTAL_PROFIT from score_results where D_TARGET_B_='1'; quit;

The DATA step merges the scored scoring data (identified by the macro variable &_SCORE) with the actual response amounts (CRSSAMP.PVA_RESULTS). The data sets are both sorted by CONTROL_NUMBER.

The SQL statements sum the actual response amounts less the $0.68 solicitation cost for every case with positive expected profit (D_TARGET_B_ = ‘1’).

1. Close the SAS Code node.

2. Run the diagram from the SAS Code node and view the results.

3. Select the Output tab and scroll horizontally to view TOTAL_PROFIT .

The total profit is substantially lower than the profits obtained by most of the entrants in the 1998 KDD-Cup. More unexpectedly, it is substantially lower than the profit obtained by the prior model’s solution of mailing to everyone.

1. Close the Results-Code Node window and the SAS Code window.

2. View Results for the Regression node attached to the Score node.

3. Select the Output tab and scroll to the odds ratio estimates for the final model.


Odds Ratio Estimates Input Odds Ratio frequency_status_97nk 1.195 income_group 1.080 median_home_value 1.000 months_since_first_gift 1.003 months_since_last_gift 0.972 pep_star 0 vs 1 0.811 recent_avg_gift_amt 0.992 recent_card_response_prop 1.592

Note that the odds ratio for RECENT_AVG_GIFT_AMT equals 0.992. This means the more money an individual donates, on average, the less likely he or she is to donate. How much less? From the odds ratios, an individual who always donates $200 is (0.992)195 = 0.209 times less likely to donate than one who always donates $5.

This is unfortunate for the veterans’ organization. Because everyone is assumed to give the fixed donation amount of $15.30, the case selection process will favor the $5 donor over the $200 donor.

The clear solution to this problem is to not only model response probability, but also model response amount. In this way individuals likely to make large donations will tend to be selected because their high donation amounts will compensate for their low donation probability. Likewise, individuals likely to make meager donations will tend to be ignored because their low donation amounts will compensate for their high donation probabilities.

These models show up in the literature the under name two-stage models or limited dependence models. Other commonly used names include double hurdle models, zero-inflated models, bivariate Tobit models, dual response models, and component models.


Improving Response Modeling

Before proceeding with a general discussion on two-stage models, consider the dramatic effect combining response probability estimates with response amount estimates can have on model performance.

1. Close the Regression-Results window.

2. Open the Code node.

3. Change the WHERE clause in the SQL code to where P_TARGET_B1*RECENT_AVG_GIFT_AMT > 0.68;

This simple modification allows the expected donation amount to vary from case to case. It assumes an individual’s expected donation amount will equal the average of his or her recent donations. The only cases selected are those for which the probability of donation multiplied by the expected donation amount exceeds $0.68.

4. Run the code node and examine the results.

The simple modification has changed the performance of the model from one of the worst to one of the best. Indeed, only $100 separates the total profit obtained by this model from the total profit obtained by the winner of the KDD-Cup competition.

The estimate used for donation amount was rather primitive. By building a model to predicted donation amount, you will obtain even better results.


1.2 Defining Case Conditional Profits

6

Solicit or Ignore

Solicit Profit

Ignore Profit

-$0.68 $14.62

0

p (x) p (x)

The decision to solicit or ignore can be thought of as a balancing act involving the expected donation probability, the expected donation amount, and the cost of solicitation. If the product of the expected donation probability and the expected donation amount exceeds the product of the expected non-donation probability and the solicitation costs, then the correct decision is solicit. Otherwise the correct decision is ignore. In other words, the correct decision is the one that yields the highest expected profit.

7

Decisions in the Balance

Decision Profit

Decision Profit

P P

0

p (x) p (x)

Many modeling scenarios can be expressed as a competition between an active decision, which may result in a positive or negative expected profit, versus a passive one, which will always result in zero profit (for example, by doing nothing). The zero-profit decision serves as a balance point around which the active decision rotates: that is, the more the force on the positive profit side of the decision balance,

1.2 Defining Case Conditional Profits 1-15

the more favorable the active decision. The amount of probability (mass) on each side of the decision balance is leveraged by the relative profit of each outcome. A low probability outcome can be highly favored if the profit for the outcome is large enough in magnitude.

8

Expected Profits

P P

0

p (x) p (x)

E(Profit | )

Decision Profit

Decision Profit

In general the profits associated with an outcome are not constants but random variables with some assumed distribution. The constant supplied in a profit matrix is the expected value of the random variable given the outcome. In this way, information about the outcome’s profit is averaged across all cases.

This is the default behavior of Enterprise Miner. As has been seen, ignoring an outcome’s profit can yield inferior results, especially when an adverse association exists between the expected outcome probability and its corresponding expected profitability.

9

Case Conditional Expected Profits

P

0

p (x) p (x)

E(Profit | , X=x )

Decision Profit

Decision Profit

Class Model

Value Model


An obvious generalization is to allow an outcome’s profit to vary by case. This greatly enhances a model’s ability to correctly predict the profit associated with a decision.

In fact, the modeling process has been divided into two separate parts. A class model is fit to establish the probability of each outcome. A value model is fit to establish the profit conditioned on the outcome and the input measurements of a given case.

10

A Two-Stage Model

Solicit Profit

Ignore Profit

-$0.68

0

p (x) p (x)

E(TARGET_D | TARGET_B=1, X=x )

The decision to solicit or ignore will be based on two separate models: one model predicting the probability of donation, the other predicting the expected donation amount. If donation probability multiplied by donation amount is higher than non-donation probability multiplied by solicitation costs, the balance favors the solicit decision. Otherwise, the balance favors the ignore decision.

11

Defining Two-Stage Model Components

E(B|X)E(D|X)

15.30X Specified values

Separate predictive models

Joint predictive modelsE(B,D|X)

There are several ways to estimate the components used in two-stage models. The first is to simply specify values for certain components. For example, you can assume a fixed donation amount for the value model. This is simple to do, but it often


produces poor results. In a somewhat more sophisticated approach, you can use the value in an input or a look-up table as a surrogate for expected donation amount.

The most common approach, however, is to estimate values for components with individual models. This approach is examined in this and the next several sections.

At the extreme end of the sophistication scale, you can use a single model to predict both components simultaneously. For many applications, this may require custom algorithm development. An example of this approach is shown at the end of the chapter.


Using the Two-Stage Model Node

There are several ways to create component models in Enterprise Miner. The first uses the Two Stage Model tool. This tool builds two models, one to estimate donation propensity and one to estimate donation amount. While this tool collapses the two models to a single node, there are some limitations to its use.

The two-stage model approach requires one change to the model metadata. Because you will build a model to predict donation amount, you must specify a donation amount target.

1. Open the Input Data Source node and select the Variables tab.

2. Set the model role for TARGET_D to target. There are now two variables with the model role of target: TARGET_B and TARGET_D.

3. Close and save changes to the Input Data Source node.

The Two Stage Model tool builds two models, one to predict TARGET_B and one to predict TARGET_D. Theoretically, you can use this node to combine predictions for the two target variables and get a prediction of expected donation amount.

Replace the Regression node in the diagram with a Two Stage Model node as shown:

Unfortunately, the node has several major limitations: • There is very limited control over the models fit in the node. The same inputs are

used in both models. There is no way to tune regression or neural network models to eliminate irrelevant or redundant inputs.

• The node does not recognize the prior vector. Thus, because responders are over-represented in the training data, the probabilities in the TARGET_B model are biased.

• The node has no built-in diagnostic to assess overall average profit. Profit information passed to the Assessment node is incorrect.

The first limitation you can either ignore or overcome using decision tree models. To overcome the last two limitations (and still use the node), add a SAS Code node that correctly calculates total profit and overall average profit for the training and validation datasets.

1. Open the Two Stage Model node.


2. Select the Output tab.

3. Check the Process or Score, Training, Validation, and Test check box.

4. Select the Settings tool bar button, . The Two Stage Model Settings window opens.

5. Change the class model selection to Regression and select OK.

6. Close and save changes to the Two Stage Model node.

7. Add a SAS Code node to the diagram after the Two Stage Model node.

8. Open the SAS Code node and import the file two stage assess-train and valid.sas.

The program begins with control macro variables. The variables PI1 and RHO1 specify, respectively, the population prior and the sample response probabilities. ADJUST_PROBS and ADJUST_PROFIT determine whether the predicted probabilities and profit calculations, respectively, should be adjusted for separate sampling. The next four macro variables name (in order) the binary target, its predicted probability, the interval target and its predicted probability. Finally, COST defines the fixed cost associated with a solicit decision. %let pi1= 0.05; %let rho1= 0.25; %let adjust_probs= yes; %let adjust_profits= yes; %let response= target_b; %let p_response= p_target_b1; %let amount= target_d; %let p_amount= p_target_d; %let cost= 0.68;


Next is a macro function definition named two_stage_assess. It calculates the total profit and the overall average profit for the specified data set. Appropriate adjustments to the class and value model predictions are made based on the values of ADJUST_PROBS and ADJUST_PROFITS. %macro two_stage_assess(dataset,out); %if (&pi1=) %then %let &pi1=&rho1; %if (&adjust_probs=yes) %then %let rho1_probs=&rho1; %else %let rho1_probs=&pi1; %if (&adjust_profits=yes) %then %let rho1_profits=&rho1; %else %let rho1_profits=&pi1;

The DATA step code creates a data set to contain the total profit and overall average profit. The program first verifies that both the TARGET_B and TARGET_D models have scored the indicated data set. It then adjusts the predicted donation probabilities and determines the expected profit. If the expected profit is positive, the actual response in AMOUNT is adjusted for separate sampling and added to the running profit total. On the last observation, total profit and overall average profit are calculated and output. The macro ends with a printout of the calculated profits. data profit(keep=source overall_average_profit total_profit) %if &out~= %then %do; &out(drop=source overall_average_profit total_profit) %end; ; set &dataset end=last; retain source "&dataset" total_profit 0; if &p_response=. or &p_amount=. then do; if &p_response=. then put "ERROR: &dataset not scored by &response model."; if &p_amount=. then put "ERROR: &dataset not scored by &amount model."; stop; end; p_adj=(&p_response*&pi1/&rho1_probs)/ ((1-&p_response)*(1-&pi1)/(1-&rho1_probs)+ &p_response*&pi1/&rho1); expected_profit=p_adj*&p_amount - &cost; weight=sum(&pi1/&rho1_profits*&response, (1-&pi1)/(1-&rho1_profits)*(1-&response)); decision = (expected_profit > 0); if decision = 1 then total_profit + sum((&amount-&cost)*&pi1/&rho1_profits* &response,-&cost*(1-&pi1)/(1-&rho1_profits)*(1-&response)); %if &out~= %then %do; output &out; %end; if last then do; overall_average_profit=total_profit/_n_; output; end; run;


proc print data=profit; run; %mend two_stage_assess;

The code ends with two calls to the two_stage_assess macro, one for training and one for validation data. %two_stage_assess(&_train); %two_stage_assess(&_valid);

1. Close the SAS Code node, save the changes, and run the entire diagram.

2. View the SAS Code results, select the Output tab, and scroll to view the complete output.

The reported overall average profit numbers are in the top three of the KDD-Cup entrants. Such a comparison, however, may be misleading: the values are from different data sets. To see how the model would have faired in the competition, you must score the competition test data.

1. Open the SAS Code node (attached to the Score node) and import the file two stage assess-score.sas.

The program is virtually identical to the two stage assess program for train and validation data. Here, however, the predicted profits are not adjusted (they are not separately sampled). Also the score data set requires the actual target values appended to the input values.

2. Close and save changes to the SAS Code node.

3. Run the diagram from this SAS Code node, view the results, and select the Output tab. The total profit of $12,561 places eighth in the KDD-Cup rankings.

This lackluster performance is primarily the result of model overfit. By including redundant and irrelevant inputs the model “discovers” a pattern found only in the training data.


Investigate whether a more selective modeling methodology such as trees improves the predictions.

1. Open the Two Stage Model node and select the Settings tool bar button, .

2. Change both the class model and value model to Tree and select OK.

3. Close and save changes to the Two Stage Model node.

4. Run the diagram from the Code node attached directly to the Two Stage Model node and view the results.


The estimated overall average profit is lower than for the regression models on both the training and the validation data. With regression models, the problem was overfitting. With tree models, the problem is underfitting.

In practice, improving your predictive results requires careful fitting of both the class and value models. The best results are obtained when this fitting is done separately for each component model.

1.3 Constructing Component Models 1-23

1.3 Constructing Component Models

13

Two-Stage Modeling Challenges

Model CouplingE(D|X,p)̂

Model Assessment

Interval Model SpecificationE(D) = g(x;w)

Constructing two-stage (or more generally, any multiple component model) requires attention to several challenges not previously encountered.

Earlier modeling assessment efforts evaluated models based on profitability measures, assuming a fixed profit structure. Because the profit structure itself is being modeled in a two-stage model, you will need a different mechanism to assess model performance.

Optimal predictive performance is obtained when models are correctly specified. Correct specification requires appropriately chosen inputs, link function, and target error distribution.

By incorporating the predictions of the binary model into the interval mode, it may be possible to make a more parsimonious specification of the interval model.


14

Estimating Mean Squared Error

X

D

Training Data

Σ (Di - Di )2^i = 1

N1NEstimated MSE =

D̂MSE

E[(D-D)2]^

Mean squared error (MSE) is a commonly employed method of establishing and assessing the relationship between inputs and the expected value of the target. MSE is estimated from a sample by differencing model predictions from observed target values, squaring these differences, and averaging across all data points in the sample.

15

D̂

MSE Decomposition: Variance

X

D

Training Data

Variance

Σ (Di - Di )2^i = 1

N1NEstimated MSE =

MSE

E[(D-D)2] = E[(D-ED)2] + [E(D-ED)]2^^

In theory, the MSE can be decomposed into two components, each involving a deviance from the true (but, alas, unknown) expected value of the target variable.

The first of these components is the residual variance of the target variable. This term quantifies the theoretical limit of prediction accuracy and the absolute lower bound for the MSE. The variance component is independent of any fitted model.


16

D̂

MSE Decomposition: Squared Bias

X

D

Training Data

Bias2

Σ (Di - Di )2^i = 1

N1NEstimated MSE =

VarianceMSE

E[(D-D)2] ^= E[(D-ED)2] + [E(D-ED)]2^

The second MSE component is the average prediction bias squared. This term quantifies the difference between the predicted and actual expected value of the target.

19

D̂

Honest MSE Estimation

X

D

Validation Data

Σ (Di - Di )2^i = 1

N1NEstimated MSE =

VarianceMSE

E[(D-D)2] ^= E[(D-ED)2] + [E(D-ED)]2Bias2

^

As always, you must be careful to obtain an unbiased estimate of MSE. MSE estimates obtained from the data used to fit the model will almost certainly be overly optimistic. Estimates of MSE from an independent validation data set allow for an honest assessment of model performance.


20

InseparabilityB̂

MSE and Binary Target Models

X

B

Validation Data

Σ (Bi - Bi )2^i = 1

N1NEstimated MSE =

Inaccuracy

E[(B-B)2] ^= E[(B-EB)2] + [E(B-EB)]2

Imprecision

VarianceMSE Bias2

^

While MSE is an obvious choice for comparing interval target models, it is also useful for assessing binary target models (Hand 1997). The estimated MSE can be thought of as measuring the overall inaccuracy of model prediction. This inaccuracy estimate can be decomposed into a term related to the inseparability of the two target level (corresponding to the variance component) plus a term related to the imprecision of the model estimate (corresponding to the bias-squared component). In this way, the model with the smallest estimated MSE will also be the least imprecise.

21



Model Assessment


Use Validation MSE

In summary, to assess both the binary and the interval component models, it is reasonable to compare their validation data mean squared error. Models with the smallest MSE will have the smallest bias or imprecision.


Stage One: The Binary Target Model

The Two Stage Model tool’s limited ability to select inputs drastically reduces it potential for building good two-stage models. By building the component models separately, however, you can easily overcome this limitation and construct outstanding two-stage predictive models.

The binary target prediction model comes first. This is fit as before except that the goal has shifted from making a good decision to making an unbiased probability prediction. To achieve this goal, you must change the model assessment criterion from profit to validation error.

You will need some room on the diagram for the analyses that follow.

1. To make room to build the component models, break the connection between the Score node and the Two Stage Model node.

2. Move aside the Score Input Data Set node, the Score node, and the SAS Code node.

3. Connect the Replacement node to a Control Point node, and move the Control Point node under Input Data Source. You will add several nodes to the diagram. This helps to keep the diagram’s appearance clean and understandable.


Previous experience with PVA_RAW_DATA suggests that parametric models such as regressions and neural networks will outperform predictive algorithms such as decision trees. Testing both a tree and a regression model is nevertheless worthwhile, because, at a minimum, it will illustrate the settings required for each model type.

The Tree tool must be configured to use MSE as the fitting criterion. The easiest way to accomplish this is to adjust the model metadata.

This metadata adjustment only applies to Tree models.

1. Connect a Data Set Attributes, Tree, and Regression node to the diagram as shown.

While it is possible to connect the tree model directly to the Data Partition node, for the purposes of this demonstration, it will be connected to the Replacement node.

2. Open the Data Set Attributes node and change the measurement scale of TARGET_B to interval.


3. Close and save changes to the Data Set Attributes node. An information window opens.

Because the measurement scale of TARGET_B is no longer compatible with target profile, subsequent analysis steps (in this branch of the process flow diagram) will ignore the target profile. If the tree model is ultimately used in the final two-stage model, its predicted probabilities will require adjustment.

4. Select OK to proceed.

Now, configure the Tree node for modeling propensity to donate.

1. Open the Tree node. The Tree Target Selector window opens.

Because the Tree node only accepts a single target variable, you must select which of the two target variables will be used in the tree analysis.

2. Select TARGET_B OK. The Tree settings window opens.

3. Select the Basic tab.

The Splitting criteria are set to be consistent with the target’s interval measurement definition. The F-test measures the worth of a split by dividing the split-induced variance reduction by the residual variance after the split. Obtaining a p-value from an F-distribution with the appropriate degrees of freedom and


taking the negative log results in a logworth measurement. The variance reduction splitting criterion is similarly obtained, except the split-induced variance is not divided by the residual variance.

4. Close and save changes to the Tree node.

The Tree node is now configured to build a tree that minimizes MSE on the validation data. Such a tree is known as a class probability tree (Brieman et al. 1984).

1. Run the Tree node and view the results.

The fit table and plot illustrate the tradeoff between model complexity and MSE. Apparently, the 12-leaf tree has the minimum validation-estimated MSE equal to about 0.1838.

2. View the Tree diagram.

Each node shows the number of observations and proportion of response in the training data (left column) and validation data (right column).

3. Close the Tree Diagram and Tree: Results windows.

As mentioned earlier, previous modeling experience suggests the regression model will outperform the tree model. To test this hypothesis, configure the Regression node for modeling.

1. Open the Regression node. The Regression: Target Selector window opens.


2. Select TARGET_B OK. The Linear and Logistic Regression settings window opens.

3. Select the Selection Method tab and select Stepwise for the method.

4. Select Validation Error for the criterion. The model selected from the stepwise sequence will have the smallest validation MSE.

5. Select the Score node and select the option Process or Score: Training, Validation, and Test. You will need the predicted probabilities later when assessing the model.

6. Run the Regression node. Name the model TARGET_B when prompted.

7. View the results and select the Statistics tab.


The model has a validation MSE of 0.1815, which is smaller than the Tree model’s value of 0.1838. As expected, the Regression model shows better performance than the Tree model.

8. Select the Output tab and review the model fit results. The odds ratios for the selected model are shown below. The model has eight inputs, similar to those found in the stepwise regression model of Section 1. Odds Ratio Estimates Input Odds Ratio frequency_status_97nk 1.195 income_group 1.080 median_home_value 1.000 months_since_first_gift 1.003 months_since_last_gift 0.972 pep_star 0 vs 1 0.811 recent_avg_gift_amt 0.992 recent_card_response_prop 1.592

9. To minimize diagram clutter, delete the Data Set Attributes and Tree nodes.


Stage Two: The Interval Target Model

Attention now shifts from the binary target model to the interval target model. Starting with a simple linear regression model will illustrate some of the modeling challenges that must be overcome.

1. Label the existing Regression node TARGET_B Regression.

2. Connect another Regression node to the TARGET_B Regression node.

3. Label the new Regression node TARGET_D Regression.

4. Open the TARGET_D Regression node and select TARGET_D as the target variable of interest.

Most of the variables have a status of don’t use and a model role of rejected. This is a result of the variable selection feature of the TARGET_B Regression node.

5. Close the Regression node.

6. Insert a Data Set Attributes node between the Regression nodes.


7. Open the Data Set Attributes node.

8. Select the training data set. The training data set has the name EMDATA.STRNxxxx, where xxxx are four random characters.

9. Select the Variables tab.

10. Select all the variables with the exception of TARGET_B, TARGET_D, and CONTROL_NUMBER.

11. Set the model role to input.

12. Close the Data Set Attributes node and save the changes.

All inputs are now available for use in the TARGET_D Regression node.

1. Open the TARGET_D Regression node.

2. Select the Selection Method tab.


3. Select Stepwise as the method and Validation Error as the criterion.


5. Select Process or Score: Training, Validation, and Test.

6. Close the TARGET_D Regression node, save changes, and name the model TARGET_D.

7. Run the diagram from the TARGET_D Regression node and view the results.

The Output tab shows the final model to have seven inputs: Analysis of Parameter Estimates

Standard Parameter DF Estimate Error t Value Pr>|t| Intercept 1 13.9215 1.0223 13.62 <.0001 FREQUENCY_STATUS_97NK 1 -0.9582 0.1409 -6.80 <.0001 MONTHS_SINCE_FIRST_GIFT 1 -0.0393 0.00421 -9.33 <.0001 RECENT_AVG_GIFT_AMT 1 0.5630 0.0259 21.73 <.0001 LIFETIME_AVG_GIFT_AMT 1 -0.1784 0.0510 -3.50 0.0005 LIFETIME_MIN_GIFT_AMT 1 -0.2311 0.0424 -5.45 <.0001 LAST_GIFT_AMT 1 0.4462 0.0216 20.63 <.0001 M_MONTH1 1 -4.5553 0.9409 -4.84 <.0001

Not surprisingly, most of the inputs relate to previous donation amounts. The signs on the donation amount inputs are surprisingly mixed. The expected donation amount increases with LAST_GIFT_AMT and RECENT_AVG_GIFT_AMT, but decreases with LIFETIME_AVG_GIFT_AMT and LIFETIME_MIN_GIFT_AMT. More donations in the last two years implies a smaller expected donation amount in the 97NK campaign: that is, the greater the elapsed time since first donation, the smaller the expected donation amount. The missing indicator for MONTHS_SINCE_LAST_PROM_RESP is a surprise. If MONTHS_SINCE_LAST_PROM_RESP is missing, the expected donation amount decreases by more than $4.50.

The Statistics tab shows a training and validation MSE of 51.95 and 86.10, respectively. This seems to be a rather large discrepancy. Large discrepancies in training and validation performance often signal poor generalization potential. This is discussed in more detail shortly.


Accurate Two-Stage Model Assessment

Now that you have both parts of your two-stage model, it is time to see how well they work together.

To evaluate the combined models performance, you will need a code node like the one used to evaluate the Two Stage Model node earlier. In fact, it would be extremely convenient if such a node could become part of the Enterprise Miner tool list.

By using the Clone node option, you can make a new tool specifically for evaluating two-stage models of the type studied in this class.

1. Close the Regression-Results window.

2. Right-click the SAS Code node attached to the Two Stage Model node and select Clone… The Clone current node window opens.

3. Type Assess Two-Stage Model for the description.

4. Click the Image selector, . The Select image window opens.


5. Select an appropriate icon from the list and select OK. For example, is found near the middle of the list.

6. Select OK in the Clone current node window.

Select the tools palette in the project navigator area and scroll the tools list to the bottom. A new tool has been added to the palette.

This tool is stored at the project level, which means that any diagram in the current project can utilize the tool.

1. Connect a newly created Assess Two-Stage Model tool to the TARGET_D Regression node.

2. Open the Assess Two-Stage Model node and make the following changes: %let adjust_probs= no;

Unlike the Two Stage Model node, the Regression TARGET_B node automatically adjusts its predicted probabilities based on the specified prior probabilities.

3. Run the Assess Two Stage Model node and view the results.

4. Select the Output tab. overall_ total_ average_ Obs source profit profit 1 EMDATA.STRNP1HA 1359.23 0.14034 overall_ total_ average_ Obs source profit profit 1 EMDATA.SVALOMMC 1705.82 0.17609

The Output tab shows a surprising result. The total and overall average profit is considerably higher for the validation data than for the training data. When this


happens, there is usually an unbalanced allocation of target values between the training and validation data sets.

Close the Assess Two Stage Model node.

1. Connect a SAS Code node to the Data Partition node.

2. Open the SAS Code node and type the following program. proc sql; select count(*) from &_train where TARGET_D>50; select count(*) from &_valid where TARGET_D>50; quit;

3. Run the node and view the Output tab of the Results window. The results show that there are nearly twice as many large donors in the validation data as in the training data.

The data, when partitioned, was balanced with respect to TARGET_B. This was adequate for building a donation propensity model, but not for building a donation amount model. You must take care to balance both TARGET_D and TARGET_B values in the training and validation data.

Enterprise Miner makes this relatively simple to do. You create a variable that partitions TARGET_D into four ranges and then stratify within each range.

1. Delete the SAS Code node.

2. Open a space in the diagram between the Input Data Source node and the Data Partition node.


3. Place a Transform Variables node in the space created.

4. Open the Transform Variables node.

5. Right-click TARGET_D and select Transform… Bucket. The Input Number window opens and prompts you for the number of buckets to create.

6. Select 5 for the number of buckets.

7. Select Close. The Select values window opens, which enables you to enter the boundaries for the bucket ranges.

Create five ranges for donation amount: less than 10, 10 to 15, 15 to 20, 20 to 50, and more than 50. The idea is to balance the bin counts (more or less) and have a bin for large donations.

1. Select 1 for the bin and type 10 for the value.





The resulting partition appears as shown.

5. Close the Select Values window and save the changes.

Notice that the Keep status of TARGET_D has changed to No and the Keep status of TARG_xxx is set to Yes.

This causes TARGET_D to be dropped from the training and validation data sets. You need TARGET_D to evaluate the models.

6. Set the Keep status of TARGET_D to Yes.

7. Close the Transform Variables window and save the changes.

You now use the TARGET_D bucket variable in combination with the TARGET_B variable to properly balance the training and validation data.

1. Open the Data Partition node.

2. Select the Stratify tab.

3. Set the status of the bucketed TARGET_D variable to use.

4. Close and run the Data Partition node. You need not view the results.

The data is now partitioned along both TARGET_B and TARGET_D. However, this has changed the training data. The resulting models may change as well.


1. Run the Regression TARGET_B node and view the results.

2. Select the Output tab and scroll down to the final model. Odds Ratio Estimates

Input Odds Ratio FREQUENCY_STATUS_97NK 1.209 INCOME_GROUP 1.079 MEDIAN_HOME_VALUE 1.000 MONTHS_SINCE_FIRST_GIFT 1.003 MONTHS_SINCE_LAST_GIFT 0.971 PEP_STAR 0 vs 1 0.801 RECENT_CARD_RESPONSE_PROP 1.680

The selected inputs are largely the same. The only difference of note is the absence of RECENT_AVG_GIFT_AMT.

1. Open the Data Set Attributes node to verify the model role of all variables.

Because the TARGET_B model did select RECENT_AVG_GIFT_AMT, its status has been set to rejected. Leaving this important input out of the TARGET_D model could have dire consequences.

2. Change the new model role for RECENT_AVG_GIFT_AMT to input.

3. Close and save the changes to the Data Set Attributes node.

Now investigate the effect of stratifying the training data on the Regression TARGET_D model.

1. Run the Regression TARGET_D model and view the results.

2. Select the Output tab and scroll to the report on the final model.

The model has substantially fewer inputs: The selected model, based on the CHOOSE=VERROR criterion, is the model trained in Step 2. It consists of the following effects: Intercept LAST_GIFT_AMT RECENT_AVG_GIFT_AMT


Both inputs relate to the magnitude of the most recent donations and have a positive effect on the expect donation amount:

Analysis of Parameter Estimates Standard Parameter DF Estimate Error t Value Pr>|t| Intercept 1 3.1470 0.2824 11.14 <.0001 LAST_GIFT_AMT 1 0.2179 0.0146 14.89 <.0001 RECENT_AVG_GIFT_AMT 1 0.6456 0.0223 28.96 <.0001

Run the Assess Two Stage Model, view the results, and select the Output tab. overall_ total_ average_ Obs source profit profit 1 EMDATA.STRNP1HA 1515.27 0.15644 overall_ total_ average_ Obs source profit profit 1 EMDATA.SVALOMMC 1524.18 0.15736

The training and validation profit values are virtually identical. Both are higher than the overall average value on the winning KDD-Cup entry.

For the final test, see how the models compare to the KDD-Cup results on the competition data.

1. Connect the set-aside Score node to the Regression TARGET_D model.

2. Open the SAS Code node (attached to the Score node) and make the following change (as before) to the code. %let adjust_probs= no;

3. Close the SAS Code node, save the changes, and run the node.

4. View the results and select the Output tab. overall_ total_ average_ Obs source profit profit 1 score_results 14950.65 0.15514

Your model is more than $200 better than the winning 1998 KDD-Cup model.

1.4 Improving Two-Stage Predictions 1-43

1.4 Improving Two-Stage Predictions

23


Model Assessment



Use Validation MSE

Up to this point, you have used a standard linear regression model to predict the expected value of TARGET_D. The model, once trained on properly stratified data, has proven to be sufficient to win the KDD-Cup. As a standard regression model, however, it may be ill-suited to accurately modeling the relationship between the inputs and TARGET_D.

24

Correct Error Distribution

Interval Model Requirements

Good Inputsx1 x3 x10

E(D) > 0 Positive Predictions

Adequate Flexibility

Matching the structure of the model to the specific modeling requirements is vital to obtaining good predictions.

The interval component of a two-stage model is often used to predict a monetary response. Random variables representing monetary amounts usually assume a skewed distribution with positive range and a variance related to expected value.


When the target variable represents a monetary amount, you must account for this limited range and skewness in the model specification.

In the end, correct specification target range and error distribution increase the chances of selecting good inputs for the interval target model. With good inputs, the correct degree of flexibility can be incorporated into the model and predictions can be optimized.


Verifying Model Requirements

There are several requirements in need of verification for the interval regression model. One of the easiest ways to verify at least some of these assumptions is with a plot of model residuals versus predicted values. There are several ways to obtain such a plot, but perhaps the easiest is to simply open the TARGET_D Regression results browser.

1. Close the Assess Two-Stage Model node, if necessary.

2. Open the TARGET_D Regression results window and select the Plot tab.

3. Select Residual: TARGET_D for the Y-axis and Predicted: TARGET_D for the X-axis.

4. Resize the window to obtain a better aspect ratio for the plot.

The scatter plot shows all predicted values are positive, but there is nothing in the model structure preventing negative predictions from occurring. More ominously, there appears to be increasing variability in Residual: TARGET_D versus increasing magnitude of Predicted: TARGET_D. This phenomenon (known as heteroscedasticity) is caused by inadequate modeling of the error distribution.

Heteroscedastic residuals lead to biased estimates of parameter variance. These variance estimates are used by model selection procedures to gauge the importance of individual inputs. Incorrect selection of inputs leads to models with poor, or at best suboptimal, generalization characteristics.

In general, the interval target model in a two-stage model requires careful construction and attention to assumptions. Failure to do so will diminish predictive performance.


25

log(E(Y |X ))

E( log(Y) | X )

Making Positive Predictions

Transform Target

Define Appropriate Link

In a binary target model, predictions were restricted to fall between zero and one. Often interval target models are also subject to range restrictions (for example, positive numbers). You can induce a limited target range for the target by specifying a transformation function or link function.

The most common approach is to apply the transformation function to the target variable, such as taking the log, before modeling occurs. Modeling then proceeds as usual, but on the transformed target. There are some unexpected complications to this approach. Ultimately predictions will be made using the target variable on its original measurement scale, not on the transformed scale. Obtaining the expected value of the target from the expected value of the transformed target is more complicated than simply applying the inverse transformation to the model predictions.

Another (non-equivalent) approach involves matching the domain of the expected value of the target to that of the predictive model. This is the approach taken in logistic regression. In generalized linear models, this is accomplished by specifying an appropriate link function.

The best approach depends, of course, on the data. Defining an appropriate link will also entail specification of a reasonable error distribution for the target variable.


26

Correct Skewness

Error Distribution Requirements

Correct Range

Correct Heteroscedasticity

Y

Ultimately, every predictive model distills to a formula relating the expected value of the target (possibly transformed) to known values of the model inputs; the choice of error (residual) distribution for the target does not explicitly appear. This choice, however, influences everything from input selection to model estimation. Correctly specifying the error distribution is vital for good prediction.

The error distribution should account for relationships that may exist between the expected value and variance of the target variable. It should allow for skewness in the model residuals. Not all target variables have symmetric residuals. Finally, the error distribution should be consistent with the numeric range of the phenomenon being modeled.

For the case where the target is a monetary amount, there are several common choices for the error distribution.


27

Specifying the Correct Error Distribution

• Normal (truncated) constant*

• Poisson ∝ E(Y)

• Gamma ∝ (E(Y))2

• Lognormal ∝ (E(Y))2

Distribution Variance

The most obvious candidate for error distribution is the normal distribution. Unfortunately, the normal distribution has a range from negative to positive infinity whereas the target variable may have a more restricted range. One way to combat this discrepancy is to truncate the values of the normal distribution. This is the approach taken in limited dependence models such as the Tobit model [Amemiya, 1984] or the Heckman model [Heckman, 1978]. The truncation is incorporated into the likelihood function for standard linear regression and model predictions are adjusted by an amount proportional to the mass in the normal distribution below the threshold.

For the donation model considered in this course, and for many monetary-related models, the error variance increases with the target’s expected value. While the variance of the truncated normal distribution initially increases with the target mean (due to truncation effects), it ultimately becomes constant. If residual plots indicate increasing error variability, better modeling results may be obtained with a Poisson, gamma, or lognormal error distribution.

Limited dependence models with truncated normal error distributions are available in SAS/ETS software’s QLIM procedure.


28







While traditionally associated with positive integer targets, the Poisson distribution is also used for interval targets, especially when the variance increases in proportion to expected value.

One disadvantage of the Poisson distribution, however, relates to its skewness properties. For small expected values, the distribution is slightly skewed. The coefficient of skewness, however, deceases to 0 (symmetry) for large expected values.

Poisson error distributions are limited to the Neural Network node.

29







Both the gamma and lognormal distributions are appropriate for interval targets whose variance increases in proportion to the square of expected value. Unlike the Poisson distribution, their skewness is independent of the expected target value.


The gamma distribution is limited to the neural network node. The lognormal distribution may be used with any modeling tool by simply taking the log of the target and using a normal error distribution with an identity link

30







100x

One additional consideration for error distribution is the tail behavior. While all the distributions in the plot have the same expected value and variance, they have increasingly heavy tails. A few extreme outliers may indicate a lognormal distribution, whereas the absence of such may imply a gamma or less extreme distribution.

31


Model Assessment



log(Target) / Specify Link and Error

Use Validation MSE

In summary, care should be taken in specifying the interval target model. A link should be chosen to match the model output to the range of the target. An error distribution should be selected to match the variance, skewness, and range of the target.


Specifying the Interval Target Model

The plot of residuals versus predicted target suggests an increasing variance with respect to increasing expected target value. As an initial approach to modeling this heteroscedasticity, a log transformation will be applied to TARGET_D.

1. Close the Regression: Results window to return to the process flow diagram.

2. Open a space between the Data Set Attributes and TARGET_D Regression nodes.

3. Insert a Transform Variables node between the Data Set Attributes node and the TARGET_D Regression node.

4. Open the Transform Variables node.


5. Right-click on the TARGET_D row and select View Distribution of TARGET_D.

The distribution for donation amount is highly skewed.

6. Right-click again on the TARGET_D row and select Transform… log.

A new variable called TARG_xxx, where xxx are three random alphanumeric characters, is added to the variables list.

7. View the distribution of the newly created variable. It shows much more symmetry.

It remains to be seen if the variance is stable after modeling.

8. Close the Variable Histogram window.

Once again note that the Keep status of TARGET_D has changed to No and the Keep status of TARG_xxx is set to Yes.

9. Set the Keep status of TARGET_D to Yes.

10. Close the Transform Variables node and save the changes.

Now configure the Regression node to use the log-transformed version of the target.

1. Set the STATUS of TARGET_D to don’t use.

2. Set the STATUS of TARG_xxx to use.

3. Close and save the changes to the TARGET_D Regression node.



5. Select the Plot tab and plot Residual: TARG_xxx versus Predicted TARG_xxx.

The plot shows now shows little evidence of heteroscedasticity. Apparently, the lognormal error distribution is well suited for the interval target model.

6. Select the Statistics tab.

The validation MSE is equal to approximately 0.1975, a few orders of magnitude smaller than the previous value. Of course, it is impossible to compare to the previous model given the new measurement scale of the target.

7. Select the Output tab and scroll to the final (selected) model.

The number of terms in the model has exploded with 22 variables and 29 degrees of freedom. Interpretation of the fitted model is complicated and a large difference in training and validation MSE indicates potential overfitting.

It turns out that the overly complex model is something of a fluke.

1. Close the Regression: Results window.

2. Open the TARGET_D Regression node and select the Selection Method tab.


3. Select the Criteria subtab.

4. Uncheck the Defaults checkbox.

5. Type 0.01 in the Significance Level:Entry field.

6. Run the TARGET_D Regression node and view the results.

7. Select the Statistics tab.

The validation MSE differs from the previous by only 10-7. Moreover, there is a much smaller difference in training and validation errors, which suggests less overfit.

8. Select the Output tab and scroll to the end. The selected model, based on the CHOOSE=VERROR criterion, is the model trained in Step 7. It consists of the following effects: Intercept FREQUENCY_STATUS_97NK LIFETIME_CARD_PROM LIFETIME_GIFT_AMOUNT LIFETIME_GIFT_COUNT MONTHS_SINCE_ORIGIN RECENT_AVG_GIFT_AMT RECENT_STAR_STATUS

The model only has seven inputs and seven degrees of freedom. Such a miniscule increase in performance fails to justify the inclusion of 15 additional inputs.

With a smaller number of inputs, it is possible to recognize donation trends. For example, the expected donation amount increases with past donation amounts. Higher donor frequency implies a smaller donation amount (but, recalling the TARGET_B model, it also implies larger donation probability).

Analysis of Parameter Estimates Standard Parameter DF Estimate Error t Value Pr>|t| Intercept 1 2.7255 0.0380 71.78 <.0001 FREQUENCY_STATUS_97NK 1 -0.1766 0.00907 -19.48 <.0001 LIFETIME_CARD_PROM 1 0.0101 0.00255 3.96 <.0001 LIFETIME_GIFT_AMOUNT 1 0.00175 0.000144 12.22 <.0001 LIFETIME_GIFT_COUNT 1 -0.0223 0.00202 -11.04 <.0001 MONTHS_SINCE_ORIGIN 1 -0.00245 0.000478 -5.12 <.0001 RECENT_AVG_GIFT_AMT 1 0.0200 0.00122 16.35 <.0001 RECENT_STAR_STATUS 1 -0.0182 0.00398 -4.58 <.0001

Does a better modeling of the error distribution result in increased predicted revenue?


1. Close the TARGET_D Regression node.

2. Open the Two-Stage Assessment node.

You will need to change the macro variable P_AMOUNT because P_TARG_xxx has replaced P_TARGET_D. However, recall that P_TARG_xxx is the expected value of the log of TARGET_D. To get the expected value of TARGET_D itself, you must undo the effects of the log transformation. Intuitively, it seems sufficient to apply the exponential function to P_TARG_xxx and thus obtain the expected value of TARGET_D. Surprisingly, however, the expected value of TARGET_D is not simply the exponential of P_TARG_xxx.

By fitting a standard regression model to the log of an interval target, you are assuming a lognormal error distribution for the target. The expected value of a lognormal random variable, Y, equals exp(µ + σ 2/2 ) where µ and σ 2 are the mean and variance of the of log(Y). Exponentiating P_TARG_xxx underestimates the true expected value of TARGET_D by a factor equal to exp (σ 2).

To correctly estimate the expected value of TARGET_D you will need to change the definition of P_AMOUNT to exp(P_TARG_xxx + MSE/2) where MSE is the training MSE obtained from the model fit statistics.

3. Type the following change in the Program tab of the Assess Two-Stage Model node: %let p_amount= exp(P_TARG_xxx+0.1866/2);

The value 0.1866 is the MSE obtained from the TARGET_D Regression node’s Statistics tab.

4. Run the Assess Two-Stage Model node, view the results, and select the Output tab. overall_ total_ average_ Obs source profit profit 1 EMDATA.STRNP1HA 1627.56 0.16803 overall_ total_ average_ Obs source profit profit 1 EMDATA.SVALOMMC 1514.64 0.15637


The validation is that total profit is (disappointingly?) about the same as the untransformed model. The training profit remains considerably higher.

Despite the similar validation profit, a correctly specified model will, on the average, perform better than an improperly specified one. As a validation, score the KDD-Cup data once more.

5. Open the SAS Code node attached to the Score node.

6. Once more, type the following change in the Program tab of the Assess Two-Stage Model node: %let p_amount= exp(P_TARG_xxx+0.1866/2);


8. Run the diagram from the same SAS Code, view the results, and select the Output tab. overall_ total_ average_ Obs source profit profit 1 score_results 15261.25 0.15837

The results indicate a moderate improvement over the standard regression model. Further improvement may be realized by incorporating more flexibility into the regression model.


32


Model Assessment

Interval Model Specificationlog(Target) / Specify Link and Error

Use Validation MSE


E(D) = g(x;w)

Use Output of Binary Modelas Input to Interval Model

A common practice in two-stage modeling is to use the output of the first stage as input for the second stage. This is thought to reduce bias and (possibly) reduce model complexity.

The first claim, reducing model bias, relates to left censoring that occurs in limited dependence models, such as the Tobit model. As discussed earlier, the Tobit model assumes a truncated normal distribution. Inclusion of a term proportional to the response probability corrects for biases induced by the truncation. When using an error distribution with strictly positive range, such as the Poisson, gamma, or lognormal, no such bias occurs and no correction is required.

However, it still may be useful to include output of the first-stage model in the second-stage model to reduce model complexity. This is accomplished by taking advantage of correlations (positive or negative) between response propensity and response magnitude. The output of the first-stage model may act as a surrogate for many inputs and thus reduce overall degrees of freedom in the second-stage model.


Coupling Models

You will now see whether there is a benefit to coupling the first- and second-stage models (in this particular case).

1. Open the Data Set Attributes node between the TARGET_B Regression node and the Transform Variables node.

2. Select the training data EMDATA.STRNxxxx.


4. Set the new model role of P_TARGET_B1 to input.

5. Close the Data Set Attributes node.

The output of the TARGET_B model is now available as an input in the TARGET_D model.



2. Select the Output tab and scroll to the Summary of Stepwise Procedure near the bottom of the output listing.

Summary of Stepwise Procedure Effect Number Step Entered DF In F Prob>F 1 RECENT_AVG_GIFT_AMT 1 1 1579.8 <.0001 2 FREQUENCY_STATUS_97NK 1 2 409.5 <.0001 3 MONTHS_SINCE_ORIGIN 1 3 74.0351 <.0001 4 LIFETIME_GIFT_AMOUNT 1 4 53.3051 <.0001 5 LIFETIME_GIFT_COUNT 1 5 113.8 <.0001 6 RECENT_STAR_STATUS 1 6 18.4633 <.0001 7 LIFETIME_CARD_PROM 1 7 15.6846 <.0001 8 SES 4 8 5.5112 0.0002 9 LAST_GIFT_AMT 1 9 10.8375 0.0010 10 FILE_AVG_GIFT 1 10 25.5115 <.0001 11 LIFETIME_GIFT_RANGE 1 11 20.6364 <.0001 12 RECENT_RESPONSE_COUNT 1 12 14.8391 0.0001 13 LIFETIME_PROM 1 13 14.2014 0.0002 14 RECENT_AVG_CARD_GIFT_AMT 1 14 11.1868 0.0008 15 M_WEALTH 1 15 10.0134 0.0016

The summary shows that P_TARGET_B1 was never considered for inclusion in the model.

It is possible to force the Regression node to consider an input in modeling.

1. Close the Regression: Results window.

2. Open the TARGET_D Regression node.

3. Select Tools Model Ordering…. The Model Ordering window opens.

4. Scroll to the bottom of the variable list and select P_TARGET_B1.

5. Select the button. This moves P_TARGET_B1 to the top of the input list.

6. Select OK. The Model Ordering window closes.

7. Select the Selection Method tab and the Criteria subtab.


8. Set Number of Variables: Start to 1.

9. Close and save changes to the TARGET_D Regression node.

The Regression node will start with a model that includes P_TARGET_B1 and add additional inputs as usual.


2. Select the Output tab and scroll to Summary of Stepwise Procedure listing. Summary of Stepwise Procedure

Effect Number Step Entered Removed DF In F Prob>F 1 RECENT_AVG_GIFT_AMT 1 2 1333.1 <.0001 2 FREQUENCY_STATUS_97NK 1 3 160.4 <.0001 3 MONTHS_SINCE_ORIGIN 1 4 69.9055 <.0001 4 P_TARGET_B1 1 3 1.7407 0.1872 5 LIFETIME_GIFT_AMOUNT 1 4 53.3051 <.0001 6 LIFETIME_GIFT_COUNT 1 5 113.8 <.0001 7 RECENT_STAR_STATUS 1 6 18.4633 <.0001 8 LIFETIME_CARD_PROM 1 7 15.6846 <.0001 9 SES 4 8 5.5112 0.0002 10 LAST_GIFT_AMT 1 9 10.8375 0.0010 11 FILE_AVG_GIFT 1 10 25.5115 <.0001 12 LIFETIME_GIFT_RANGE 1 11 20.6364 <.0001 13 RECENT_RESPONSE_COUNT 1 12 14.8391 0.0001 14 LIFETIME_PROM 1 13 14.2014 0.0002 15 RECENT_AVG_CARD_GIFT_AMT 1 14 11.1868 0.0008 16 M_WEALTH 1 15 10.0134 0.0016

The input that P_TARGET_B1 enters is forced in the model before input selection commences. By the fourth step, it is removed due to lack of predictive power.

Any linear association between P_TARGET_B1 and the log of TARGET_D is adequately modeled by the inputs RECENT_AVG_GIFT_AMT, FREQUENCY_STATUS_97NK, and MONTHS_SINCE_ORIGIN.

By using the FORCE option rather than the START option in the Criteria subtab of the Selection method, you may actual force P_TARGET_B1 into the model. The result, in this case, will be a more complex model with a lower validation overall average profit.

In general, you should try including the predicted binary target as an input for the interval target model. If the input selection process fails to include the input, it is not necessary to force the input into the model to get valid results.


Using Regression Trees

Thus far most attention has focused on standard linear regression models. These models make strong assumptions about the relationship between the inputs and the target(s).

You will now explore some nonlinear and nonadditive approaches to prediction. Attention focuses on the interval target model because previous experience suggests there will be little improvement with a non-regression approach for the binary target model.

1. Disconnect and move aside the Input Data Source, Score, and Code nodes.

2. Connect a Tree node to the Transform Variables node.

3. Open the Tree node and select TARGET_D as the target variable.

4. Select the Score tab.




The tree model has a slightly smaller validation MSE than the initial regression model.


Splits occur on the suspected variables relating to the amount received in recent donations. It should also be noted that the right leaves, corresponding to the larger split values almost always contain fewer observations than the left leaves. This is not a coincidence.

1. Close the Results: Tree window.

2. Connect an Insight node to the Tree node.


3. Run the Insight node and view the results. An Insight table with a 2000-case sample of the training data opens.

4. Select (in order) the R_TARGET_D column and the P_TARGET_D column.

5. Select Analyze Scatter Plot (YX). A scatter plot of the TARGET_D residuals versus the TARGET_D predicted opens.

The plot shows the same flaring pattern seen initially in the TARGET_D Regression node.

Heteroscedasticity is a problem for trees as well as regressions. With heteroscedastic residuals, the split worth values become distorted. This distortion results in suboptimal splits, which usually isolate a handful of high target value cases, if they find any significant splits at all.

To overcome this problem, you should build the tree using the log of TARGET_D.

1. Close the Insight window.

2. Delete the Insight node.

3. Open the Tree node.

4. Set the status of TARGET_D to don’t use.

5. Set the status of TARG_xxx to use.



The validation MSE, once more for the log of the target, is much smaller than the values observed on the TARGET_D Regression model. Inspection of the Tree diagram reveals better-balanced splits with good discrimination in the expected target values. The splits are restricted to inputs related to prior gift amounts: RECENT_AVG_GIFT_AMT, LAST_GIFT_AMT, and LIFETIME_AVG_GIFT_AMT.

Assess the effect on overall average profit realized by using the Tree model.

1. Connect an Assess Two-Stage Model node to the Tree node.

2. Open the Assess Two-Stage Model node.

3. Make the following changes to the macro variables defined in the Program tab: %let adjust_probs= no; %let p_amount= exp(P_TARG_xxx+0.1359/2);


4. Run the Assess Two-Stage Model node and view the results. overall_ total_ average_ Obs source profit profit 1 EMDATA.STRNP1HA 1594.99 0.16467 overall_ total_ average_ Obs source profit profit 1 EMDATA.SVALOMMC 1531.33 0.15810

The validation overall average profit is slightly higher for this model than for the previous models. This is true despite the fact that trees are notoriously bad performers when most of the inputs are interval variables. Using the inputs selected by the tree in a nonlinear parametric model (for example, a neural network) should yield even better results.


Using Basic Neural Networks

The tree suggests significant nonlinear and nonadditive associations between the inputs and the interval target. Because the inputs selected by the tree have an interval measurement scale, it should be possible to obtain marginally better predictions using a parametric modeling tool like a neural network.

1. Move the Assess Two-Stage Model node to make room for an additional node.

2. Connect a Neural Network node between the Tree and Assess Two-Stage Model node.

3. Open the Neural Network node.


Most of the inputs are rejected and set to don’t use. Surprisingly, all of the targets are set to use. Unlike the other modeling tools in Enterprise Miner, the neural network node allows multiple targets.

You will utilize this fact later. For now, turn off all the targets except the log of the TARGET_D.

4. Set the status to don’t use for all targets except for the target corresponding to the log of TARGET_D.

5. Scroll to the bottom of the variable list.

6. Set the status to don’t use for the variable V_TARG_xxx. This variable, the tree predicted target value (using the validation data), is mistakenly listed as an input to the model.

7. Select the General tab.

8. Select Average Error for the model selection criterion.

9. Select the Basic tab.

10. Select the button next to Multilayer Perceptron.

The Set Network Architecture window opens. This window enables you to specify the number of hidden units in a neural network.


You should always use this window explicitly specify the number of hidden units to use in a neural network model. Using the automatic architecture specifications may result in a neural network with an unexpected number of hidden units.

11. Select Set number… in the Hidden neurons pop-up field.

12. Type 5 in the Set number… field.

13. Select OK. The Set Network Architecture window closes.



16. Run the Neural Network node and view the results.

As expected, the neural network model’s validation MSE is slightly smaller than the tree model’s validation MSE.

Now check the combined model profitability.

1. Open the Assess Two-Stage Model node and make the following changes to the Program tab: %let adjust_probs= no; %let p_amount= exp(P_TARG_xxx+0.1358/2);



The validation overall average profit once more shows an increase, this time versus the tree model.


Using Advanced Neural Networks

The Advanced tab of the Neural Network node allows you to use error distributions other than the normal or lognormal. This enables you to fine tune the error distribution to exactly match the modeling scenario.

1. Open the Advanced Neural Network node.

2. Set the status of TARGET_D to use.

3. Set the status of TARG_xxx (the log of TARGET_D) to don’t use.

By using TARGET_D as the target (instead of its log transformation), issues of heteroscedasticity re-emerge. These issues can be literally defined away by changing the error distribution.


2. Select Advanced user interface. The Advanced replaces the Basic tab for configuring the network.

3. Select the Advanced tab. A schematic diagram of the network appears.


The cyan node labeled interval represents all the interval inputs. Similar nodes would appear for any nominal and ordinal inputs with Status use. The blue node labeled 5 represents five hidden units. The yellow node labeled T_INTRVL represents the interval target, TARGET_D. The arrows represent (full) connection between each of the network layers.

Good predictive results were obtained using the lognormal error distribution. Similar results may be achieved with a gamma error distribution. Like the lognormal, the gamma distribution’s variance increases as the square of expected value. But it differs from the lognormal distribution in two ways: • Given identical expected values and variances, the lognormal distribution has

heavier tails. • The gamma distribution will weight training data cases differently than the

lognormal distribution.

The first of these differences was illustrated earlier. The second relates to equivalence between maximum likelihood estimation and iteratively reweighted least squares estimation.

In this equivalence, maximum likelihood parameter estimates can be calculated by weighted least squares, where the weight is proportional to the distribution variance. For the gamma distribution, the weight is proportional to the square of the expected target value. For the normal distribution (for the log of the target variable), the weight is a constant. So, roughly speaking, the gamma error distribution concentrates more model degrees of freedom on cases with a small expected target value, whereas the lognormal error distribution (obtained by taking the log of the target) concentrates more model degrees of freedom on cases with a large expected target value.

For additional neural network theory, see the Neural Network Modeling course.

To specify a gamma error distribution, you must • specify the error distribution and appropriate link function • change the objective function to maximum likelihood • change the default optimization algorithm.


First, specify the new error distribution.

1. Double-click the target, T_INTRVL. The Node Properties window opens.

2. Select the Target tab.

The target tab controls a variety of factors related to the target variables, including error distribution, activation (link) function, and node input combination function.

3. Select Exponential for the activation function.

4. Select Gamma for the error function.


5. Select OK. The Node Properties window closes.

When the error is changed to gamma, the computer beeps and a message appears in the lower-left corner of the SAS window.

The warning suggests that the modeling objective function will be changed to either deviance or maximum likelihood. Using a deviance objective function sets the shape parameter of the gamma error distribution equal to one. Using a maximum likelihood objective function allows the shape parameter to be estimated from the training data.

To give more flexibility to the error distribution, specify a maximum likelihood objective function.

1. Select the Optimization subtab.

2. Select Maximum Likelihood for the objective function.

Finally, change the training algorithm from the default (Levenberg-Marquardt) to a less greedy first order method.

1. Select the Train subtab.

2. Uncheck Default settings.

3. Select Double Dogleg for the training technique.


You are ready to fit the model.

1. Close and save the changes to the Neural Network node.

2. Run the Neural Network node and view the results.

The Tables tab reports the MSE once more on the original scale of TARGET_D. The values are smaller than what was observed for both the original regression and the tree built with TARGET_D as the target.

1. Close the Results-Neural Network window.

2. Open the Assess Two-Stage Model node and make the following changes to the Program tab: %let p_amount= P_TARGET_D;


The validation overall average profit remains virtually unchanged from the log TARGET_D model.

1.5 Joint Predictive Models 1-75

1.5 Joint Predictive Models

When you defined the interval target neural networks, it was noted that the Neural Network node allows for multiple targets. In this section, you exploit this capability to define a joint predictive model, which simultaneously predicts the expect values of the binary and interval targets.

Unfortunately, defining a joint predictive model using the Neural Network node is somewhat cumbersome, especially if the model contains more than a handful of inputs. An optional demonstration simplifies the task by directly running the Neural procedure.


Building a Joint Predictive Model

Unlike other modeling nodes in Enterprise Miner, the Neural Network node does not require a single target. You will now use this capability to simultaneously predict values for TARGET_B and TARGET_D.

1. Connect a Neural Network node to the tree. Label the node Joint Neural Network.

2. Open the Joint Neural Network node.

3. Set the status of all targets other than TARGET_B and TARG_xxx, the log of TARGET_D, to don’t use.

4. Set the status of V_TARG_xxx to don’t use.

You must reactivate the inputs associated with prediction of the binary target, TARGET_B, and access the advanced user interface.

1. Set the status of the following inputs to use: • FREQUENCY_STATUS_97NK • INCOME_GROUP • MEDIAN_HOME_VALUE • MONTHS_SINCE_FIRST_GIFT • MONTHS_SINCE_LAST_GIFT • PEP_STAR • RECENT_CARD_RESPONSE_PROP


3. Select Average Error for the model selection criterion.


4. Select Advanced user interface. The Advanced tab replaces the Basic tab for configuring the network.

5. Select the Advanced tab. A schematic diagram of the network appears.

The diagram shows interval and nominal input nodes, as well as interval and binary target nodes. To set up the joint prediction model, you need to • separate the variables used to predict TARGET_B from the variables used to

predict TARGET_D • define separate hidden layers for TARGET_B and TARGET_D • connect the network elements.

First, separate the input variables for the two targets.

1. Open the INTERVAL input node. The Node properties window opens.



The inputs LAST_GIFT_AMT, LIFETIME_AVG_GIFT_AMT, and RECENT_AVG_GIFT_AMT are used exclusively for predicting TARGET_D. You should transfer these inputs into a separate node.

3. Select LAST_GIFT_AMT, LIFETIME_AVG_GIFT_AMT, and RECENT_AVG_GIFT_AMT.

4. Select Transfer Single new node. A window opens to confirm the transfer.

5. Select Yes. A window opens to acknowledge completion of the transfer.

6. Select OK to close the transfer acknowledgement window.


7. Select OK to close the Node Properties window. The network diagram is updated to show the new input node.

Although the new node is labeled LAST_GIFT_AMT, it really contains all three transferred inputs.

Now create a new hidden layer and connect the network elements.

1. Right-click in the network diagram workspace and select Add hidden layer from the pop-up menu. A hidden node with three hidden units is added to the diagram.


2. Delete the connection between the node labeled LAST_GIFT_AMT and the hidden layer node.

3. Delete the connection between the hidden layer node and the target node T_INTRVL.

4. Connect the LAST_GIFT_AMT node to the new hidden layer node.


5. Connect the new hidden layer node to the interval target node, T_INTRVL.

As presently defined, the interval and binary target models are uncoupled. Predictions for TARGET_B and the log of TARGET_D will be made simultaneously, but independently. It is possible to couple the models by connecting the binary target’s hidden layer node to either the interval target’s hidden layer node or the interval target node itself.

6. Connect the binary target hidden layer node to the interval target hidden layer node.

The network architecture definition is now complete.

As in the previously defined neural network, the default training technique should be changed to a less greedy algorithm.

1. Select the Train subtab.

2. Uncheck Default settings.


3. Select Double Dogleg for the training technique.

Finally, as always, add the predicted values to the training and validation data.



You are ready to run the Joint Neural Network node.

1. Close the Joint Neural Network node and save the changes.

2. Run the Joint Neural Network node and view the results.

3. Summary statistics are presented for both targets. First for TARGET_B…

…and then for log of TARGET_D.


The validation MSE for TARGET_B is lower than the observed value for TARGET_B regression model, whereas the MSE for TARG_xxx is approximately the same as it was for the TARG_xxx Neural Network.

1. Close the Results-Neural Network window.

2. Connect an Assess Two-Stage Model node to the Joint Neural Network node.

3. Open the Assess Two-Stage Model node and make the following changes to the Program tab: %let p_amount= exp(P_TARG_xxx+0.1380/2);

Note that the TARGET_B probability estimates from the Joint Neural Network model are not adjusted for priors. Thus, it is unnecessary to change the default value (yes) for the macro variable ADJUST_PROBS.



The validation overall average profit is far larger than any other model thus far. Strangely, it is much larger than the training overall average profit. This is true despite efforts to balance the training and validation data on both TARGET_B and TARGET_D.

One might expect such a disparity would result in poor generalization. Investigate this by scoring the KDD-Cup competition data.

1. Connect the previously set-aside Score node to the Joint Neural Network model.

2. Open the SAS Code node (connected to the Score node) and make the following change: %let p_amount= exp(P_TARG_xxx+0.1380/2);


4. Run the diagram from the same SAS Code node, view the results, and select the Output tab. overall_ total_ average_ Obs source profit profit 1 score_results 16066.63 0.16672

The results are spectacular, with nearly a 10% improvement over the winning KDD-cup model.

In truth, the results are somewhat misleading. Neural Network models that are fit using stopped training methods are somewhat unstable in the results they produce. A small change in the initialization of the model fitting would very likely produce a smaller total profit.

To study the variability of the Joint Neural Network predictions, 100 models were fit using the technique of this section. The mean total profit on the KDD-Cup data over the 100 runs was $15,620. However, restricting to the top five models (ranked by validation overall average profit) increased the mean to $15,825, which is still more than $1000 dollars larger than the KDD-Cup best.


This suggests an algorithm for fitting JNN models: try multiple initializations and take the model with the best validation error. Unfortunately, manually configuring multiple initializations would be extremely laborious. If it could be done in code, however, such a plan becomes feasible. The next section illustrates how to configure a joint neural network using SAS code.


Coding a Joint Predictive Model (Optional)

Using the graphical user interface to define a joint neural network becomes extremely cumbersome when there are more than a handful of inputs or some of the inputs are shared between binary and interval target models. First, you must manually activate the appropriate inputs for each of the two component models. Then, for each measurement scale (interval, ordinal, and nominal), you must extract inputs shared by both targets, followed by inputs unique to one target. After being extracted, the input nodes must be correctly connected to the hidden and target nodes. Any change in the order of these operations affects the model parameter initialization and leads to a different model.

Sometimes it is easier to simply code the network definition using the NEURAL procedure directly. This demonstration illustrates how to accomplish this coding.

The joint neural network modeling code can be placed in a macro program and run with a large number of different initializations. Fitted models can be ranked by validation profit and the best model/initialization combination selected for deployment.

To take advantage of existing metadata definitions and input preparations (such as missing value replacement), the joint neural code will be developed in an Enterprise Miner SAS Code node.

1. Connect a SAS Code node to Transform Variables node. Label the node JNN Code.

2. Open the JNN Code node.

3. Select the Exports tab.

4. Uncheck Pass imported data sets to successors.

5. Add training and validation export data sets.


6. Select File Import File Joint Neural Network Procedure.sas.

The imported program takes the place of several Enterprise Miner nodes. First, it fits a joint neural network model similar to the Joint Neural Network node. Second, it calculates assessment information similar to the Assess Two-Stage Model node. Finally, it scores and assesses the competition data, similar to the Score nodes.

The last function is added for the convenience of this demonstration and requires a pre-imputed version of the competition data. In practice, the scoring of the final data would be handled outside the JNN Code node. Passing the model scoring code to subsequent nodes requires additional changes to the SAS Code node. These changes are shown at the end of the demonstration.

To begin, the JNN procedure loads a SAS source code entry containing the macro used throughout the course for assessing two-stage models (see the Assess Two-Stage Model node for actual code listing). filename assess catalog "crssamp.assess"; %include assess("assess2stage.source");

Then there is a variety of macro variables that define model roles: • p_amount defines the variable (or function) for calculating the expected value

of the TARGET_D. • b_target and d_target respectively define the binary and interval targets. • b_intrvl and d_intrvl respectively define the interval inputs for the binary and

interval targets. • b_class and d_class respectively define the class (binary, ordinal, and nominal)

inputs for the binary and interval targets. *** PREDICTED TARGET *********************; %let p_amount= P_TARGET_D; *** TARGETS ******************************; %let b_target= TARGET_B; %let d_target= TARGET_D; *** INPUTS *******************************; %let b_intrvl= FREQUENCY_STATUS_97NK


INCOME_GROUP MEDIAN_HOME_VALUE MONTHS_SINCE_FIRST_GIFT MONTHS_SINCE_LAST_GIFT RECENT_CARD_RESPONSE_PROP; %let d_intrvl= RECENT_AVG_GIFT_AMT LAST_GIFT_AMT LIFETIME_AVG_GIFT_AMT; %let b_class= PEP_STAR ; %let d_class= ;

The next macro variables define network settings: • hidden_b and hidden_d respectively define the number of hidden units in the

binary and interval target components. • d_error defines the error distribution for the interval target component. • d_act defines the activation function for the interval target component. • objective defines the object function. • random_seed defines the random number used for parameter initialization.

*** NETWORK *****************************; %let hidden_d = 3; %let hidden_b = 3; %let d_error = gamma; %let d_act = exponential; %let objective = likelihood; %let random_seed = 90567;

The network is configured to use an exponential activation function with a gamma distribution for the target error distribution. This is similar to the network used in the earlier demonstration “Using Advanced Neural Networks.”

The DMDB procedure creates a modeling-ready data set for the NEURAL procedure. DMDBCAT is a catalog containing metadata information about the variables in the DMDB. The NEURAL requires both the DMDB dataset and catalog for modeling. proc dmdb data=&_train out=dmdbtrain dmdbcat=dmdbtraincat; var &b_intrvl &d_intrvl &d_target; class &b_class &d_class &b_target ; run;

The next statement suppresses the output from the NEURAL procedure. The output, network settings and parameter estimates, is of limited interest in this demonstration. ods listing close;

The NEURAL procedure is called. The options define the DMDB data set and catalog, the validation data (used for early stopping), and the random seed (used for initialization). proc neural data=dmdbtrain dmdbcat=dmdbtraincat validata=&_valid random=&random_seed;


The INPUT, HIDDEN, and TARGET statements define groups of variables within the neural network. Each group can then be referred to later by the name supplied in the ID argument.

For TARGET nodes, the error distribution, ERROR, and activation functions, ACT, can be supplied. *** input nodes; input &d_intrvl / level=interval id=d_intrvl_inputs; * input &d_class / level=nominal id=d_class_inputs; input &b_intrvl / level=interval id=b_intrvl_inputs; input &b_class / level=nominal id=b_class_inputs; *** hidden nodes; hidden &hidden_d / id=hidden_d; hidden &hidden_b / id=hidden_b; *** target nodes; target &d_target / level=interval error=&d_error act=&d_act id=d_target; target &b_target / level=nominal id=b_target;

Because there are no class inputs for the interval target component, the definition for input node d_class_inputs is commented out.

The next NEURAL procedure statements define how the previously defined nodes are connected. *** connections; connect b_intrvl_inputs hidden_b; connect b_class_inputs hidden_b; connect d_intrvl_inputs hidden_d; * connect d_class_inputs hidden_d; connect hidden_b hidden_d; connect hidden_b b_target; connect hidden_d d_target;

The network is saved to a network definition file. This file will be used later for final parameter estimates. save network=work.jnn.network;

The NETOPTIONS statement defines the objective function. netoptions object=&objective;


With the network and training parameters defined, the neural network is trained. The OUTFIT option combined with the ESTITER option defines a data set with iteration-by-iteration estimates of model fit. This data set is used later to find the iteration with minimum validation error. The OUTEST option defines a data set with iteration-by-iteration values for the parameter estimates. The TECHNIQUE option defines the training technique. train outfit=outfit_neural(where=(_name_="OVERALL")) estiter=1 outest=nnoutest technique=dbldog; run;

The short SQL statement creates a macro variable called ITER that contains the iteration with the smallest validation MSE. proc sql noprint; select _iter_ into :iter_tune from outfit_neural having _vase_=min(_vase_); quit;

A second invocation of the NEURAL procedure is used to score the training, validation, and test data using the neural network model with smallest validation error. The NETWORK option reads the network definition file saved in the first invocation of the NEURAL procedure. proc neural data=dmdbtrain dmdbcat=dmdbtraincat validata=&_valid network=work.jnn.network;

Parameter estimates are read from the OUTEST file corresponding to the iteration with minimum validation MSE. initial inest=nnoutest(where=(_iter_=%scan(&iter_tune,1)));

The network is “trained” for one iteration (that is, the network takes the values of the parameters read in by the initial statement, above). train maxiter=1;

The fitted neural network model is used to score the training, validation, and competition (score) data sets. Fit statistics are saved for the training and validation data. score data=&_train out=&_tra outfit=trainfit role=TRAIN; score data=&_valid out=&_val outfit=validfit role=VALID; score data=crssamp.imputed_score out=score;

The CODE statement saves the network scoring code to a SAS catalog source entry. With additional configuration to the SAS Code node, this source entry can be incorporated into the scoring code assembled by the Score node. code metabase=emdata.jnn.network.source; run; ods listing;

The remaining code is devoted to displaying the modeling results in a easy-to-read format.


First, the training and validation fit statistic data sets are combined. Four variables are created containing the name of the fit statistic, and the values for training, validation, and score data. data result; length STATISTIC $ 36; format TRAIN 12.4 VALID 12.4 SCORE 12.3; merge trainfit validfit; STATISTIC = "MSE " || compress(_name_); TRAIN = _MSE_; VALID = _VMSE_; SCORE = .; if _name_~='OVERALL'; keep STATISTIC TRAIN VALID SCORE; run;

Profit scores for the training and validation data are calculated using the TWO_STAGE_ASSESS macro. The resulting profit data is reformatted to conform to the standard established above. %let adjust_profits= yes; %two_stage_assess(&_tra); data train_profit; length STATISTIC $ 36; set profit; STATISTIC = "TOTAL PROFIT"; TRAIN = total_profit; output; STATISTIC = "OVERALL AVERAGE PROFIT"; TRAIN = overall_average_profit; output; keep STATISTIC TRAIN; run; %two_stage_assess(&_val); data valid_profit; length STATISTIC $ 36; set profit; STATISTIC = "TOTAL PROFIT"; VALID = total_profit; output; STATISTIC = "OVERALL AVERAGE PROFIT"; VALID = overall_average_profit; output; keep STATISTIC VALID; run;


The profit calculation is repeated for the competition score data. Unlike the training and validation data, the competition data is not oversampled, so profit calculations should not be adjusted. %let adjust_profits= no; %two_stage_assess(score); data score_profit; length STATISTIC $ 36; format score 12.4; set profit; STATISTIC = "TOTAL PROFIT"; SCORE = total_profit; output; STATISTIC = "OVERALL AVERAGE PROFIT"; SCORE = overall_average_profit; output; keep STATISTIC SCORE; run;

Finally, all the assessment data is brought together into a single data set and sent to the Output window. data profit_merge; merge train_profit valid_profit score_profit; run; proc append base=result data=profit_merge force; run; proc print data=result; run;

Run the JNN Code node, view the results, and select the output tab. A table summarizing the performance of the model is presented. Obs STATISTIC TRAIN VALID SCORE 1 MSE TARGET_D 72.1772 75.2095 . 2 MSE TARGET_B 0.1825 0.1820 . 3 TOTAL PROFIT 1561.3913 1523.4093 15004.190 4 OVERALL AVERAGE PROFIT 0.1612 0.1573 0.156

The performance of the model must be described as disappointing compared to the previous joint neural network. Much of the poor performance can be traced to the disparate scales of the interval and binary targets.

To fit the model, the NEURAL procedure must optimize the joint likelihood of the two target variables. Roughly speaking, the target variable with the largest MSE will dominate this optimization. Part of the success of the joint neural network model for log of TARGET_D can be attributed to the fact that the MSEs of both targets were comparable. This happy coincidence yielded an excellent model. Here, the MSEs differ by almost two orders of magnitude. If you were to rescale the interval target by a factor between 10 and 100, you would, on the average, obtain better results.


Creating a Joint Neural Network Modeling Tool

Presently, the JNN Code node lacks one feature to make it a functioning EM modeling tool: it must be able to export score code to the Score node. The following example incorporates this functionality into the JNN Code node.

1. Open the JNN Code node.

2. Select Score – Score the new data (client or server) from the pop-up menu.

3. Select File Import File Joint Neural Network Scoring Code.sas. A two-line program is read into the Scoring code window. filename jnnscore catalog "emdata.jnn"; %include jnnscore("network.source");

The program adds the source file created by the NEURAL procedure to the scoring code created by the Score node.


4. Check the Enabled checkbox.

5. Close the JNN Code node.

You can verify the functionality of the JNN Code node by connecting the Score node to the JNN Code node and running the diagram from the Score node.

Chapter 2 Explaining a Two-Stage Model

2.1 Types of Explanations....................................................................................................2-3

2.2 Explaining with Trees.....................................................................................................2-4

2.3 Explaining by Example ..................................................................................................2-9

2.4 Creating a Surrogate Model.........................................................................................2-12

2-2 Chapter 2 Explaining a Two-Stage Model

2.1 Types of Explanations 2-3

2.1 Types of Explanations

3

Decision Explanations

Tree summaries≈

Decision examples

Surrogate models≈

Consumers of modeling results usually do not take the models without some understanding of their predictions. This is the advantage of simple models such as trees and standard regressions: you not only obtain predictions but also explanations of the predictions. With a decision tree, you can explain the reason for a decision by simply following down the tree and reading the rules that defined a leaf. With a regression, you explain the reason for a decision by examining the model parameters and, in the case of logistic regressions, the associated odds ratios.

With two-stage models, even if the individual component models are explainable, the ultimate decision that comes from combining the models may not be. This is because predictions involve a non-linear combination of the two component models.

This chapter introduces three techniques for explaining modeling results. All three exploit the fact that two-stage models are ultimately used to make decisions. By explaining the decision, you are ultimately explaining the model.

The first technique, demonstrated in the course Predictive Modeling Using Enterprise Miner™, employs a decision tree to predict the decision of a two-stage model. This leads to a set of simple rules that define decision segments. To explain the model, you state the rules that lead to a particular segments. The segments can then be ordered by relevant criteria such as overall value.

The second technique builds on the first. The cases within a segment are examined, and examples from each segment are extracted. Presenting the decision consequences for these examples leads to a general understanding of how the model works.

In the third technique, a simple regression model is used to model the decision of a complex model. This derivative regression model can then be used to describe the broad trends of the original model. In some situations the model may act as a surrogate for the original model.


2.2 Explaining with Trees

In this section a Decision Tree model is used to model the decisions made by the Joint Neural model fit in Section 1.5. By understanding the decisions of the model, you can come to understand the model itself.

The decision boundary formed by the Joint Neural model can be highly non-linear, and it may even comprise several disjoint regions. This motivates the use of decision trees: they can approximate (hyper-) surfaces like these.

2.2 Explaining with Trees 2-5

Creating Decision Segments

This demonstration illustrates constructing tree explanations of two-stage models. You will modify the Assess Two-Stage Model to attach expected profit and decision information to the training and validation data. You will then construct a tree to explain the decisions made by the Joint Neural Network node.

1. Open the Assess Two-Stage Model node attached to the Joint Neural Network node.

2. Modify the last two lines of the program in the Program tab as follows:

%two_stage_assess(&_train,&_tra); %two_stage_assess(&_valid,&_val);

The second macro parameter identifies a scored version of the assessed data. Not only will the macro create fit statistics for &_TRAIN, it also creates a new data set named &_TRA with the same columns as &_train, plus columns containing expected profit estimates, adjusted probabilities, weights, and decisions.

Define &_TRA and &_VAL:

1. Select the Exports tab.

2. Uncheck Pass imported data sets to successors.

3. Add training and validation export data sets.


4. Close the SAS Code node, save changes, and run the modified Assess Two-Stage Model node. There is no need to view the results.

Next, change the Model Roles of several training data variables.

1. Connect a Data Set Attributes node to the modified Assess Two-Stage Model node.

2. Open the Data Set Attributes node and select the Variables tab.

3. Set the new model role for all variables to rejected.

4. Set the new model role for the following variables to input: • FREQUENCY_STATUS_97NK • INCOME_GROUP • LAST_GIFT_AMT • LIFETIME_AVG_GIFT_AMT • MEDIAN_HOME_VALUE • MONTHS_SINCE_FIRST_GIFT • MONTHS_SINCE_LAST_GIFT • PEP_STAR • RECENT_AVG_GIFT_AMT • RECENT_CARD_RESPONSE_PROP

5. Set the new model role for DECISION to target.

6. Set the new model role for WEIGHT to freq. In Enterprise Miner, the terms weight and freq are used interchangeably.

2.2 Explaining with Trees 2-7

7. Close and save changes to the Data Set Attributes node.

Now, build the explanation tree.

1. Attach a Tree node to the Data Set Attributes node.


Of the 9,686 decisions made on the training data, the Tree model correctly reproduces 8,279 or about 85% accuracy.

3. View the tree diagram.

The initial split occurs on RECENT_AVG_GIFT_AMT. About 75% of individuals who have recently given more than $15 on average will receive a solicitation using the Join Neural Network model. Additional splits refine these rules.

You can use the Define Colors option to emphasize the most frequent decision in a node.


4. Select Tools Define colors from the menu bar.

5. Select the Proportion of a target value button.

6. Select 1 from the Select a target value list.

7. Select OK.

Nodes that have a majority of solicit decisions are colored green. Nodes that have a majority of ignore decisions are colored red. Nodes with a mixture of decisions are colored yellow.

The selected Tree model has 27 leaves, which is somewhat large for interpretation purposes.

Change the number of leaves to 19. The accuracy has been reduced by 1%, but the tree now is a little simpler. You can experiment with accuracy versus tree size trade-offs (and other tree options) to achieve a description of the Joint Neural Network model that is both understandable and accurate.

2.3 Explaining by Example 2-9

2.3 Explaining by Example

While a decision tree provides a general description of a predictive model, consumers of model information (for example, domain experts) may come to better appreciate the model results by understanding the decision consequences for a select set of cases. One way to illustrate this is to draw a set of cases from the various tree nodes and display a variety of assessment information for each.


Selecting Example Cases

This demonstration focuses the tree explanation on specific cases drawn from the training data.

1. Open the Tree node settings window and select Process or Score: Training, Validation, and Test.

2. Close the Tree settings window. Node identification information will be added to the training data set when the diagram is run.

3. Connect an Insight node to the Tree node.

4. Run the Insight node and view the results.

Create a box plot of expected profit versus tree node ID.

1. Select Analyze Box Plot / Mosaic Plot (Y).

2. Select _NODE_ as the X variable.

3. Select EXPECTED_PROFIT as the Y variable.

4. Select WEIGHT as the FREQ variable.

5. Select OK. A Box Plot window opens.

2.3 Explaining by Example 2-11

The horizontal line in each box represents the median profit within the node. If the horizontal line is above zero, the node has a majority of solicit decisions.

From inspection of the plot, it is apparent that node 15 has the highest expected profits. But what defines node 15?

Unlike other nodes in Enterprise Miner, the Insight node results can be left opened while other nodes are run.

1. Select the Enterprise Miner diagram window.

2. View the Tree results.


4. Select View Statistics.

5. Set the view node ID to Yes. Node IDs are added to the tree diagram.

The Node IDs are assigned left to right, top to bottom.

Node 15 is characterized by above average income and high recent average gift amount.

6. Return to the box plot and double-click the observation with highest expected profit in node 15. The values of all variables associated with the selected case are shown.

Note the extremely high recent average gift amount. This is certainly somebody worth targeting.

A similar analysis can be conducted for other cases in other nodes. You can use the data-marking capabilities of Insight to mark examples for later review.


2.4 Creating a Surrogate Model

In Section 2.2, you created a Tree model to describe the decisions produced by a more complicated model. The decisions made with the Tree model agreed with the decision made by the more complicated model more than 85% of the time.

The Tree model provides insight into the reasons for the decisions produced by the complex model, but it is not a substitute for the original model. While Tree models have the theoretical property of universal approximation, there are practical limits to the classification accuracies achievable with a limited number of splits based on finite data. Much of this is due to their limited ability to produce smooth predictions.

Is it possible to obtain an even more precise description of complex models like the Joint Neural Network model and still have insight into model workings? Just as changing from a Tree model to a Regression improved MSE for the binary target model in Chapter 1, it may also improve descriptive accuracy here.

2.4 Creating a Surrogate Model 2-13

Creating a Descriptive Regression Model

Everything needed to create the Regression Model is already in place. You simply need to add the node to the diagram.

1. Connect a Regression node to the Data Set Attributes node.

2. Open and configure the Regression node for stepwise selection. The default selection criterion, Profit/Loss, is ideally suited for this zero-noise modeling problem.

3. Run the Regression node and view the Statistics tab. The misclassification rate of 0.14 puts the accuracy of the model at 86%. Not much of an improvement. This begs the question of how the models disagree. This question will be answered shortly.

4. Select the Output tab and scroll to the bottom of the report.

Most of the available inputs are used in the model. The odds ratios describe how the various factors influence the solicit decision. Odds Ratio Estimates Input Odds Ratio FREQUENCY_STATUS_97NK 4.428 INCOME_GROUP 1.906 LAST_GIFT_AMT 1.254 LIFETIME_AVG_GIFT_AMT 1.141 MEDIAN_HOME_VALUE 1.001 MONTHS_SINCE_FIRST_GIFT 1.026 MONTHS_SINCE_LAST_GIFT 0.840 PEP_STAR 0 vs 1 0.125 RECENT_AVG_GIFT_AMT 1.312 RECENT_CARD_RESPONSE_PROP 8.668


Deploying a Surrogate Model

While building a derivative model certainly aids in understanding the original, how well would it work as a substitute or surrogate for the original?

Perhaps first you should ask why you would want to do this. There are several answers. First, deploying a neural network model requires functions, like the hyperbolic tangent, that may not be available in all deployment environments. Second, sometimes speed is just as important as accuracy. Simpler code with similar accuracy may be very desirable in some applications. Finally some applications, such as credit risk scorecards, require additive results. An additive model based on a non-additive model may provide a good balance between performance and interpretability.

This final demonstration uses the Derivative Regression model in place of the Joint Neural Network model on the 1998 KDD-Cup competition data. Can an extremely simple surrogate model perform as well as the extremely complex models entered into the KDD-Cup competition?

1. Connect competition Score node to the most recently added Regression node.

2. Run the Score node. Do not view the results.

3. Connect a SAS Code node to the Score node.

2.4 Creating a Surrogate Model 2-15

4. Open the SAS Code node and import the program surrogate assess-score.sas.

5. Close the SAS Code node and save the changes.

Run the SAS Code node, view the results, and select the Output tab. The total profit of $15,328.51 beats the best 1998 KDD-Cup model by more almost $600 dollars, but it does so as an additive model.

To understand, at least partly, why the regression model works so well, you must study under what conditions the original model and the surrogate model disagree.

1. Import the file surrogate assess-agreement.sas into the Program tab of the SAS Code node.

The program creates a view with two variables: EXPECTED_PROFIT (calculated from the original model) and AGREE. The variable AGREE indicates whether the original and surrogate model agree or disagree. The data is truncated to only include EXPECTED_PROFIT values less than 3. data score_results/view=score_results; merge &_score crssamp.pva_results; run; data agree/view=agree; set score_results; keep expected_profit agree; p_adj=p_target_b1*(0.05/0.25)/ ((1-p_target_b1)*(0.95/0.75)+p_target_b1*(0.05/0.25)); expected_profit= p_adj*exp(p_targ_mto+0.1380/2)-0.68; if expected_profit < 3; if ((p_decision1>0.5) and (expected_profit>0)) or ((p_decision1<0.5) and (expected_profit<0)) then agree='yes'; else agree='no '; run;

The view is opened in SAS/INSIGHT, and two plots are generated: a box chart showing the frequency of AGREE=YES and AGREE=NO, and a distribution plot of EXPECTED_PROFIT. proc insight; open agree/nodisplay; box agree; dist expected_profit; run; quit;

2. Run the SAS Code node and view the results

3. Click on the box indicating AGREE=NO and examine the distribution plot for EXPECTED_PROFIT.


The distribution plot shows the distribution of EXPECTED_PROFIT for AGREE=NO within the overall distribution of EXPECTED_PROFIT.

Most of the disagreement occurs for expected profits near zero, in other words, near the decision boundary for the original model. This explains why the surrogate model is so competitive with the original model. Cases near the decision boundary contribute little to the overall average profit of a predictive model.

Therefore, in answer to the original question: in this case, a surrogate model can indeed perform as well as a complex predictive model.

sas notes_two stage modeling using enterprise miner software (2002)

Documents

stage modeling

stage predictive model

multistage models

stage predictions

information course description

trademarks of sas institute

predictive modeling

course code pmms