bidm assignment no1

6
BIDM Assignment No. 1 Predictive Modelling Using Decision Trees A supermarket offers a new line of organic products. The supermarket’s management wants to determine which customers are likely to purchase these products. The supermarket has a customer loyalty program. As an initial buyer incentive plan, the supermarket provided coupons for the organic products to all of the loyalty program participants and collected data that includes whether these customers purchased any of the organic products. The ORGANICS data set contains over 22,000 observations. The Data Mining Objective is to determine whether a customer would purchase organic products or not . The target variable (ORGYN) is a binary variable that indicates whether an individual purchased organic products or not. Dataset: ORGANICS (uploaded on claroline). You need to build a Decision Tree Model using SAS enterprise miner. Steps to be followed: 1) Create a new folder and upload all the SAS Datasets and specially the ORGANICS dataset in the folder. Create a new library and link it to the folder. Steps to be followed are listed below When you open SAS 9.2, several libraries are automatically assigned and can be seen in the Explorer window. 1. Double-click on the Libraries icon in the Explorer window.

Upload: ee052022

Post on 20-Jul-2016

26 views

Category:

Documents


0 download

DESCRIPTION

bidm

TRANSCRIPT

Page 1: BIDM Assignment No1

BIDM Assignment No. 1

Predictive Modelling Using Decision Trees

A supermarket offers a new line of organic products. The supermarket’s management wants to determine which customers are likely to purchase these products.

The supermarket has a customer loyalty program. As an initial buyer incentive plan, the supermarket provided coupons for the organic products to all of the loyalty program participants and collected data that includes whether these customers purchased any of the organic products.

The ORGANICS data set contains over 22,000 observations. The Data Mining Objective is to determine whether a customer would purchase organic products or not . The target variable (ORGYN) is a binary variable that indicates whether an individual purchased organic products or not. Dataset: ORGANICS (uploaded on claroline). You need to build a Decision Tree Model using SAS enterprise miner.

Steps to be followed:

1) Create a new folder and upload all the SAS Datasets and specially the ORGANICS dataset in the folder. Create a new library and link it to the folder. Steps to be followed are listed below

When you open SAS 9.2, several libraries are automatically assigned and can be seen in the Explorer window.

1. Double-click on the Libraries icon in the Explorer window.

To define a new library:

2. Right-click in the Explorer window and select New.

Page 2: BIDM Assignment No1

3. In the New Library window, type a name for the new library. For example, type CRSSAMP.

4. Type in the path name or select Browse to choose the folder to be connected with the new library name. For example, the chosen folder might be located at C:\workshop\sas\dmem.

5. If you want this library name to be connected with this folder every time you open SAS, select Enable at startup.

6. Select OK. The new library is now assigned and can be seen in the Explorer window.

Page 3: BIDM Assignment No1

7. To view the data sets that are included in the new library, double-click on the icon for Crssamp.

2) Open SAS Enterprise Miner To start Enterprise Miner, type miner in the command box or select Solutions Analysis Enterprise Miner.

3) Create a new project (File-New – Project ) and a diagram4) Drag the Input Data Source to the Diagram Subspace. Open the Input Data Source Node and

Select ORGANICS as the source data. Select Change in Metadata sample and select use complete data as sample. Change role of variable ORGYN to target

5) Connect Multiplot and insight nodes to input datasource node.. Run the Multiplot Node and explore the results

Page 4: BIDM Assignment No1

6) Set the roles for the analysis variables (Check that modelrole for custid is set to id , while model role for DOB, EDATE and LCDATE is set to rejected). Also, set the model role for variables AGEGRP1, AGEGRP2 and NEIGHBOURHOOD to rejected.

Set the model role for ORGYN to target (check that the measurement role is binary), while model role of ORGANICS should be set to rejected.

As noted above, only ORGYN will be used for this analysis and should have a role of Target. (Try using ORGANICS as an input variable, report the results of the decision tree and answer the following question) ? Can ORGANICS be used as an input for a model that is used to predict ORGYN .

Why or why not?

Attach a screenshot_1 of the input data source showing the model role of all the variables7) Connect data partition node and partition the dataset into training – 60% and validation –

40%8) Connect a Decision Tree Node. Open the Node and in the basic settings select gini reduction

as splitting criterion. Keep the default stopping rules (no. of observations in leaf node, observations required for split search, max branches from node, max no. of levels). You may try changing splitting criterion and stopping rules and see the impact on results

9) In the advanced settings, select proportion misclassified of as the assessment criterion. Click on score and select process or score training, validation and test and click on show details of validation.

10) Run the decision tree node and explore the results. Go to View – Tree to view the tree . Go to View – tree options to change the no. of levels that you want to view the tree. Go to Plot and Table to explore the misclassification error vs the no. of leaves plot. What is the no. of leaves and the corresponding misclassification error in the final selected/pruned subtree. Go to Score and variable selection to see the variables ranked in the order of importance.

Attach screenshots of the tree results (tree, plot and final selected variables) Shot_2,3,411) Connect Insight Node to the decision Tree Nodes. Open the insight node, select entire

dataset and validation dataset. Run insight node and explore the resultsAttach screenshot of the insight node results shot_512) Connect Assessment Node to both the Decision tree Nodes . Run and explore the results

(lift charts – cumulative and non cumulative % response chart, % captured response and lift value). What is the cumulative % response, % captured response and lift value for the decision tree at 10 percentile and 20 percentile. Also, what is the non-cumulative % response % captured response and lift value at 20 percentile

10 20

cumulative % response 79 65

cumulative % captured response

32 52

cumulative lift value 3.2 2.6

Page 5: BIDM Assignment No1

non-cumulative % response 51

non-cumulative % captured response

20

non-cumulative lift value 2

Attach screenshots of cumulative and noncumulative % response and % captured response charts shot_6, shot_7, shot_8, shot_913) Connect a 3-way decision tree (a decision tree with max no.of branches = 3), run and view

the results. Also, connect the assessment node to the 3-way tree. Based on misclassification error and the lift charts, which model would you select (and why? ) if

a. If you have to target top 50% responders b. If you have to target top 20% responders