predictive modeling with enterprise miner

71
Copyright © 2011 Clarity Solution Group Predictive Modeling with Enterprise Miner Jeffrey Strickland, Ph.D. Senior Consultant

Upload: jeffrey-strickland-phd-cmsp-asep

Post on 29-Jul-2015

285 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Predictive Modeling with Enterprise Miner

Copyright © 2011 Clarity Solution Group

Predictive Modeling with

Enterprise Miner

Jeffrey Strickland, Ph.D.

Senior Consultant

Page 2: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Learning Objectives

• To understand the application of regression analysis in data mining

• Linear/nonlinear

• Logistic (Logit)

• To understand the key statistical measures of fit

• To learn how to run and interpret regression analyses using SAS Enterprise Miner software

Page 3: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

SAS Enterprise Miner

• These results can be obtained using Excel or using a data mining package such as SAS Enterprise Miner 5.3

• Using SAS Enterprise Miner requires the following steps:• Convert your data (usually in an Excel file) into a SAS data

file Using SAS 9.1

• Create a project in Enterprise Miner

• Within the project:

• Create a data source using your SAS data file

• Create a diagram that includes a data node and a regression node and a multiplot node for graphs

• Run the model in the diagram and review the results

Page 4: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Creating a SAS data file from an Excel file: open SAS 9.1. Select File

then Import Data

Page 5: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

This opens the import wizard. Since the source file is from Excel,

click Next. Then click Browse to find the TempKWatts.xls file

Page 6: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Since the data are on sheet1$, click Next. Then enter SASUSER

as the Library and TEMPKILOWATTL as the Member. Then click

Next

Page 7: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Now click Finish to create your file

Page 8: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Open SAS Enterprise Miner 5.3. Enter the user name and

password provided

Page 9: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

The Enterprise Window below opens. Select New Project

Page 10: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

The Create New Project dialog box appears. Select the General

tab, then type the short name of the project, e.g., KWattTemp0.

Keep the default path.

Page 11: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

In the Startup code tab, enter:

libname Ktemps "C:\Documents and Settings\mliberat\My Documents\My SAS

Files\9.1\EM_Projects";

This code will be run each time you open the project

Page 12: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

The Enterprise Miner application window opens

Page 13: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Right-click on Data Source, opening the wizard. Source is SAS

table, so click Next

Page 14: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Browse the SAS libraries to find the SAS table Tempkilowattl

found in the SASuser Library (previously created)

Page 15: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Click Next twice. Note that the Table properties shows that we

have two variables with 12 observations

Page 16: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

The next step controls how Enterprise Miner organizes

metadata for the variables in your data. Select advanced,

then click next

(you can view/change the settings if you click Customize

before clicking Next)

Page 17: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Change Role of KWatts to target (outcome variable); change

Level of both KWatts and Temp to interval (continuous

values); then click Next (Other levels are possible, such as

binary). You can click on Explore if you wish to look at some

basic stats – we will do this later

Page 18: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Here Role relates to the role of the data set (raw, train, validate,

score); raw is fine for our analysis of data, so click Finish

Page 19: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Tempkilowattl now appears under Data Sources in the top left panel

called the Project Panel

Page 20: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

We need to create a Diagram for our model. Right-click on

Diagrams, then enter TempKwatts0 in the dialog box. Now the

left panel shows TempKwatts0 as a Diagram, and the right-

hand panel is called the Diagram Workspace. Icons can be

dragged and dropped onto the Diagram Workspace.

Page 21: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Now add an Input Data Node to the Diagram. From the Data

Sources list in the Project Panel drag and drop the Data

Source TempKwatts0 onto the Diagram Workspace. Note that

when input data node is highlighted, various properties are

displayed on the left-hand panel.

Page 22: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

If you wish to see the properties of any or all of the variables,

highlight the input data node; then on the left hand Properties

Panel under Train, click on the box to the right of Variables; in

the screen that opens control-click on KWatts and Temp; then

click on Explore in the lower right

Page 23: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Frequency distributions for the variables and the raw data are

provided. Right-clicking on observations in the lower-left panel will

show where they appear in the bar charts. Cancel when finished.

Page 24: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Click on the Explore tab found over the Diagram Workspace,

and then drag and drop the Multiplot icon onto the field. Using

your cursor, draw a directed arrow from the TempKwattsl icon

to the Multiplot icon. With the Multiplot icon highlighted, its

properties are found in the left-hand Properties Panel.

Page 25: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Right-click on the Multiplot icon and select Run. After the run is

completed select Results from the Run Status window.

Page 26: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Various charts are available as shown below. Descriptive statistics

for each variable are given in the lower pane.

Page 27: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Click on the Model tab and drag the Regression icon onto the

Model field. Connect the Tempkwattsl icon to the Regression

icon. Highlight the Regression icon and on the Property Panel

change Regression Type to linear regression.

Page 28: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Run the Regression and select Results. Starting from the

upper left and going clockwise, these windows show the fit

between target and predicted in percentile terms, the various

fit statistics, model output (estimates, F and t stats, R-square),

and the two effects (intercept and slope – bars represent size

and color represents direction)

Page 29: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

For a given percentile, the Target Mean is the actual (or

estimated value based on actuals), or what you are trying to

predict; the Mean for Predicted is the forecasted values, or the

predictions (or estimated values based on forecasts). The

results are shown from highest to lowest forecasted values.

The distances between the curves shows how well the model

predicts the actual data.

Page 30: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

A variety of fit statistics are provided. These include SSE,

MSE=SSE/(n-2), ASE=SSE/n, RMSE=SQRT(MSE),

RASE=SQRT(SSE), FPE = MSE (n+p+1)/n, MAX = largest error in

terms of absolute value, where n = no. of observations, p=no. of

variables in model (one in our case).Schwartz’s Bayesian Criterion and Akaike’s Information Criterion

are used for model selection (comparing one model to another).

Schwartz’s adjusts the residual squared error for the number of

parameters estimated, while Akaike’s is a relative measure of

information lost from fitting the model.

Page 31: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Kwatts vs. Temp Example 2

• Another approach to modeling the relationship between Kwatts and Temp is to use a nonlinear regression

• This is easily accomplished in Enterprise Miner – highlight the regression node, then in the left hand panel select yes for polynomial terms

• We use the default of two terms

• Is the fit any better???

Page 32: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Page 33: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Multiple Regression

Consider the following data relating family size and income to food expenditures:

family food $ income $ family size1 5.2 28 32 5.1 26 33 5.6 32 24 4.6 24 15 11.3 54 46 8.1 59 27 7.8 44 38 5.8 30 29 5.1 40 110 18 82 611 4.9 42 312 11.8 58 413 5.2 28 114 4.8 20 515 7.9 42 316 6.4 47 117 20 112 618 13.7 85 519 5.1 31 220 2.9 26 2

Page 34: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Multiple Regression

• We can run this problem in Enterprise Miner using the same approach followed with the previous example

• On our model field we have placed the data source called foodexpenditures, and also both Multiplot and StatExplore found under the Explore tab above the model field

• Highlight foodexpenditures, then in the left-hand panel under Training, find variables and click on the box to the right to open up the variables

• Change the role of family to rejected (it is just the number of the observation) and change the level of food_ to target, and income_,food_, and fam_size to interval, then click OK

Page 35: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Foodexpenditures Model

Page 36: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Highlight the StatExplore node, right-click to Run, then

select Results. Correlations between the input variables and

the target are provided, along with basic statistics. The input

variables are ordered by the size of the correlations. Now

close out the results window and run the regression node

and obtain results

Page 37: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Starting from the upper left and going clockwise, these

windows show the fit between target and predicted in

percentile terms, the various fit statistics, model output

(estimates, F and t stats, R-square), and the three effects

(intercept and slopes for the two input variables with bars

represent size and color represents direction). The model is

significant and is a good fit with the data.

Page 38: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

What happens in regression analysis when the target variable is binary?

• There are many situations when the target variable is binary – some examples:

• whether a customer will or will not receive credit

• whether a customer will or will not response to a promotion

• Whether a firm will go bankrupt in a year

• Whether a student will pass an exam!!!

Page 39: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Passing an Exam Data

Student id Outcome Study Hours

1 0 3

2 1 34

3 0 17

4 0 6

5 0 12

6 1 15

7 1 26

8 1 29

9 0 14

10 1 58

11 0 2

12 1 31

13 1 26

14 0 11

Page 40: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Running a linear regression to predict pass/don’t pass as a function

of hours of study provides a model that doesn’t correctly model the

data. The data are given in exampassing.xls

Passing an Exam

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 10 20 30 40 50 60 70

hours of study

pa

ss o

r d

on

't p

ass

Actual

Predicted

Page 41: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

The Enterprise Miner results show a poor fit on a percentile basis

between predicted and target – another modeling approach is

needed.

Page 42: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Logistic Regression

• Similar to linear regression, two main differences

• Y (outcome or response) is categorical

• Yes/No

• Approve/Reject

• Responded/Did not respond

• Result is expressed as a probability of being in either group.

Page 43: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Comparing the Logistic & Linear Regression Models

Page 44: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Logisitic regression

p = Prob(y=1|x) = exp(a+bx)/[1+exp(a+bx)]

1-p =1/[1+exp(a+bx)]

ln [p/(1-p)] = a + bx

where:

exp or e is the exponential function (e=2.71828…)

ln is the natural logarithm (ln(e) = 1)

p is probability that the event y occurs given x, and can range

between 0 and 1

p/(1-p) is the "odds ratio"

ln[p/(1-p)] is the log odds ratio, or "logit"

all other components of the regression model are the same

Page 45: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Odds Ratio

• Frequently used

• Related to probability of an event as follows:

Odds Ratio = p/(1-p)

• Example:

• Probability of firm going bankrupt = .25

• Odds firm will go bankrupt = .25/(1-.25) = 1/3 or 3 to 1

• This is how sports books calculate odds

• (e.g., if odds of VU winning a championship are 2:1, probability is 1/3

• ln [p/(1-p)] = a + bx means that as x increases by 1, the natural log of the odds ratio increases by b, or the odds ratio increase by a factor of exp(b)

Page 46: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Probability, Odds Ratio, LN of Odds Ratio

-5

0

5

10

15

20

25

0.05 0.

10.15 0.

20.25 0.

30.35 0.

40.45 0.

50.55 0.

60.65 0.

70.75 0.

80.85 0.

90.95

probability

odds

nl(odds)

Page 47: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Running the exam data: Change regression type from linear

regression to logistic regression

Highlight the data node; on left-hand panel under Train open

variables and change the level of outcome to binary

Page 48: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Results show a much better fit (upper left) and only one

misclassification (lower right – a false negative).

Page 49: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

The results show that the odds ratio = p(1-p) = exp(-

8.4962+0.4949x). For every additional hour of study the

odds ratio increases by a factor of exp(0.4949)= 1.640

Page 50: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Understanding Response Rate and Lift

To better understand the top left chart, change cumulative lift to

cumulative % response. The observations are ranked by the

predicted probability of response (highest to lowest) for each

observation (from the fitted model).

Page 51: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Understanding Response Rate and Lift

• Since the first 6 passes were correctly classified, the cumulative %

response is 100% through the 40th percentile.

• At the 50th percentile the next observation with the highest predicted

probability is a non-response, so the cumulative response drops to

6/7 or 85.7%.

• The 8th ranked observation, between the 55th and 60th percentile, is

a positive response, so the cumulative % response is about 7/8 or

87%.

• Since there are no more positive responses after the 60th

percentile, the cumulative response rate will drop to 50%.

• The chart compares how well the cumulative ranked predictions

lead to a match between actual and predicted responses

Page 52: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Understanding Response Rate and Lift

• Lift calculates the ratio of the actual response rate (passing) of the

top n% of the ranked observations to the overall response rate.

Cumulative lift is likewise defined.

• At the 50th percentile, the cumulative % response is 88.7%, the

cumulative base response is 50%, for a lift of 1.7142.

Page 53: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

On the Properties Panel, click on Exported Data to see the predicted

probabilities and response for each observation and compare to the

actual response.

Page 54: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Logistic regression uses maximum likelihood (and not sum

of squared errors) to estimate the model parameters. The

results below show that the model is highly significant based

on a chi-square test. The Wald chi-square statistic tests

whether an effect is significant or not.

Page 55: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Bankruptcy Prediction

• To predict bankruptcy a year in advance, you might collect:

• working capital/total assets (WC/TA)

• retained earnings/total assets (RE/TA)

• earnings before interest and taxes/total assets (EBIT/TA)

• market value of equity/total debt (MVE/TD)

• sales/total assets (S/TA)

Page 56: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Bankruptcy Training Data

Firm WC/TA RE/TA EBIT/TA MVE/TD S/TA BR/NB1 0.0165 0.1192 0.2035 0.813 1.6702 12 0.1415 0.3868 0.0681 0.5755 1.0579 13 0.5804 0.3331 0.081 0.5755 1.0579 14 0.2304 0.296 0.1225 0.4102 3.0809 15 0.3684 0.3913 0.0524 0.1658 1.1533 16 0.1527 0.3344 0.0783 0.7736 1.5046 17 0.1126 0.3071 0.0839 1.3429 1.5736 18 0.0141 0.2366 0.0905 0.5863 1.4651 19 0.222 0.1797 0.1526 0.3459 1.7237 110 0.2776 0.2567 0.1642 0.2968 1.8904 111 0.2689 0.1729 0.0287 0.1224 0.9277 012 0.2039 -0.0476 0.1263 0.8965 1.0457 013 0.5056 -0.1951 0.2026 0.538 1.9514 014 0.1759 0.1343 0.0946 0.1955 1.9218 015 0.3579 0.1515 0.0812 0.1991 1.4582 016 0.2845 0.2038 0.0171 0.3357 1.3258 017 0.1209 0.2823 -0.0113 0.3157 2.3219 018 0.1254 0.1956 0.0079 0.2073 1.489 019 0.1777 0.0891 0.0695 0.1924 1.6871 020 0.2409 0.166 0.0746 0.2516 1.8524 0

Page 57: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Bankruptcy Example

• Using the BankruptTrain.xls data create a SAS data file called bankrupt• BR_NB: role is target and level is binary

• Firm: role is rejected and level is nominal (it is simply the firm number)

• Remaining five financial ratio variables: role is input and level is interval

Page 58: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Create a diagram named bankrupt1. Drag and drop the data node

onto the model. Highlight the data node and on the left hand panel

under variables click on the box to its right to see the variables data

Page 59: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

From the Explore tab drag and drop the StatExplore node onto

the diagram and link it to the bankrupt node. Highlight the

StatExplore node, right-click and run it, and obtain results. On

top, correlations between the five input variables and the target

are shown via bars ordered from largest to smallest. Below the

mean variable score for bankrupt vs. non-bankrupt

observations is shown.

Page 60: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

From the Model tab drag and drop the regression node onto the

diagram and connect it to the bankrupt node. Highlight the

regression node and run, and obtain the results

Page 61: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

The results show that the model fits the data very well with

highly significant overall chi square statistic, low error values,

and 0 misclassifications. Cumulative lift shows that for the

top 50% of observations that are bankrupt, they are twice as

likely to be classified as bankrupt.

Page 62: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Scoring

• Once you have specified a model you might wish to apply it to new data whose

outcome is unknown -- make predictions

• This can be easily accomplished in Enterprise Miner using scoring

• Convert the data set BankruptScore.xls to a SAS file called bankruptscore. The

role of this data is score.

Page 63: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Bankruptcy Scoring Data

Firm WC/TA RE/TA EBIT/TA MVE/TD S/TA

A 0.1759 0.1343 0.0956 0.1955 1.9218

B 0.3732 0.3483 -0.0013 0.3483 1.8223

C 0.1725 0.3238 0.104 0.8847 0.5576

D 0.163 0.3555 0.011 0.373 2.8307

E 0.1904 0.2011 0.1329 0.558 1.6623

F 0.1123 0.2288 0.01 0.1884 2.7186

G 0.0732 0.3526 0.0587 0.2349 1.7432

H 0.2653 0.2683 0.0235 0.5118 1.835

I 0.107 0.0787 0.0433 0.1083 1.2051

J 0.2921 0.239 0.9673 0.3402 0.9277

Page 64: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Drag and drop the bankruptscore data node to the bankrupt1

diagram. From the Assess tab, drag and drop the Score node

into the diagram. Link the regression and bankruptscore

nodes together and connect them to the Score node.

Page 65: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Run the Score node and obtain the Results. Of the 10 firms, 6

are predicted to become bankrupt.

Page 66: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

For details about the individual predictions, highlight the Score

node and on the left-hand panel click on the square to the right of

Exported Data. Then in the box that appears click on the row whose

Port entry is Score. Then click on Explore.

Page 67: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

The lower portion of the output is shown below. The predictions

are given, along with the probabilities of the firm becoming

bankrupt or not.

Page 68: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Regression Using Selection Models

• When there are a number of possible input variables, procedures are available

to sort through them and include those that have a certain level of statistical

significance

• SAS Enterprise Miner 5.3 offers three selection methods:

• Backward

• Forward

• Stepwise

Page 69: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Regression Using Selection Models

• Backward: training begins with all candidate effects in the model and removes effects until the stay significance level or the stop criterion is met

• Forward: training begins with no candidate effects in the model and adds effects until the entry significance level or the stop criterion is met.

• Stepwise: training begins as in the forward model but may remove effects already in the model. This continues until the stay significance level or the stop criterion is met

Note that the default significance levels (p values) values are 0.05 and no stop criteria (such as maximum number of steps in the regression) are set

Page 70: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Regression Using Selection Models – Bankruptcy Model

To select stepwise regression

for the bankruptcy model, highlight

the regression node and in the

properties panel under

Selection Model choose

Stepwise. The default significance

level of 0.05 is used

Page 71: Predictive Modeling with Enterprise Miner

Copyright © 2012 Clarity Solution Group

Regression Using Selection Models – Bankruptcy Model

• Interestingly, the Training Model only uses RE/TA as a predictor

• There are 3 misclassifications (.15 rate) in this set vs. 0 in the original model

• The results are very different: the original model with all 5 input variables predicted bankruptcy for G, E, C, and J, while the stepwise model predicted B, C, D, F, G, H, and J would become bankrupt.

• Changing the significance levels to 0.1 (to make it easier for input variables to enter/leave the stepwise model) produces the same results