issues in data mining applications -tutorial- nemanja jovanovic, [email protected] valentina...

39
Issues in Data Mining Issues in Data Mining Applications Applications -Tutorial- -Tutorial- Nemanja Jovanovic, [email protected] Nemanja Jovanovic, [email protected] Valentina Milenkovic, [email protected] Valentina Milenkovic, [email protected] Prof. Dr. Veljko Milutinovic, [email protected] Prof. Dr. Veljko Milutinovic, [email protected] Authors Authors : : How to Make A Decision How to Make A Decision About Your Own Data Mining Tool? About Your Own Data Mining Tool?

Upload: victoria-singleton

Post on 25-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Issues in Data Mining ApplicationsIssues in Data Mining Applications-Tutorial--Tutorial-

Nemanja Jovanovic, [email protected] Jovanovic, [email protected]

Valentina Milenkovic, [email protected] Milenkovic, [email protected]

Prof. Dr. Veljko Milutinovic, [email protected]. Dr. Veljko Milutinovic, [email protected]

Authors:Authors:

How to Make A DecisionHow to Make A DecisionAbout Your Own Data Mining Tool?About Your Own Data Mining Tool?

Page 2: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Numb

er: 2

Data Mining vs. Knowledge Mining = ?Data Mining vs. Knowledge Mining = ?

??

Page 3: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Numb

er: 3

Evolution Of Data MiningEvolution Of Data Mining

Prospective, proactive information delivery

Lockheed,

IBM, SGI,

numerous startups

Advanced algorithms, multiprocessors, massive databases

What’s likely to happen to Boston unit sales next month? Why?

Data MiningData Mining

(2000)(2000)

Retrospective, dynamic data delivery at multiple levels

Pilot, IRI,

Arbor, Redbrick, Evolutionary Technologies

OLAP, Multidimensional databases,

data warehouses

What were unit sales in New England last March?

Drill down to Boston.

Data NavigationData Navigation

(1990s)(1990s)

Retrospective, dynamic data delivery at record level

Oracle, Sybase Informix, IBM, Microsoft

RDBMS,

SQL,

ODBC

What were unit sales in New England

last March?

Data AccessData Access

(1980s)(1980s)

Retrospective,

static data delivery

IBM,

CDC

Computers,

tapes,

disks

What was my average total revenue over the last 5 years?

Data Collection Data Collection (1960s)(1960s)

CharacteristicsProduct ProvidersEnabling Technologies

Business QuestionEvolutionary StepEvolutionary Step

Page 4: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Numb

er: 4

Examples of DM projects to stimulate your imaginationExamples of DM projects to stimulate your imagination

Here are six examples of how data mining is helping corporations Here are six examples of how data mining is helping corporations to to operate more efficiently and profitably in today's business environmentoperate more efficiently and profitably in today's business environment.

– Targeting a set of consumers Targeting a set of consumers who are most likely to respond to a direct mail campaignwho are most likely to respond to a direct mail campaign

– Predicting the probability of default for consumer loan applicationsPredicting the probability of default for consumer loan applications

– Reducing fabrication flaws in VLSI chipsReducing fabrication flaws in VLSI chips

– Predicting audience share for television programsPredicting audience share for television programs

– Predicting the probability that a cancer patient Predicting the probability that a cancer patient will will respond to radiation therapyrespond to radiation therapy

– Predicting the probability that an offshore oil well is actually going Predicting the probability that an offshore oil well is actually going to produce oil to produce oil

Page 5: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Numb

er: 5

Comparison of forteen DM toolsComparison of forteen DM tools

Evaluated by four undergraduates inexperienced at data mining, Evaluated by four undergraduates inexperienced at data mining, a relatively experienced graduate student and a relatively experienced graduate student and a profesional data mining consultant a profesional data mining consultant

Run under the MS Windows 95, MS Windows NT, Run under the MS Windows 95, MS Windows NT, Macintosh System 7.5Macintosh System 7.5

Use one of the four technologies: Use one of the four technologies: Decision Trees, Rule Inductions, Neural or Polynomial NetworksDecision Trees, Rule Inductions, Neural or Polynomial Networks

Solve two binary classification problems: Solve two binary classification problems: multi-class classification and noiseless estimation problem multi-class classification and noiseless estimation problem

Price from 75$ to 25.000$Price from 75$ to 25.000$

Page 6: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Numb

er: 6

Comparison of forteen DM toolsComparison of forteen DM tools

The Decision Tree products were The Decision Tree products were - - CART CART

- Scenario - Scenario - See5 - See5

- S-Plus - S-Plus The Rule Induction tools were The Rule Induction tools were

- - WizWhy WizWhy - - DataMindDataMind

- - DMSK DMSK Neural Networks were built from three programsNeural Networks were built from three programs

- - NeuroShell2NeuroShell2- PcOLPARS - PcOLPARS

- - PRW PRW The Polynomial Network tools were The Polynomial Network tools were

- - ModelQuest Expert ModelQuest Expert - - Gnosis Gnosis - a module of - a module of NeuroShellNeuroShell22

- - KnowledgeMiner KnowledgeMiner

Page 7: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Numb

er: 7

Criteria for evaluating DM toolsCriteria for evaluating DM tools

A list of 20 criteria for evaluating DM tools, put into 4 categories:A list of 20 criteria for evaluating DM tools, put into 4 categories:

CapabilityCapability measures what a desktop tool can do, measures what a desktop tool can do, and how well it does itand how well it does it

- Handless missing data- Handless missing data- Considers misclassification costs- Considers misclassification costs

- Allows data transformations- Allows data transformations- Quality of tesing options- Quality of tesing options

- Has programming - Has programming languagelanguage - Provides - Provides useful output reportsuseful output reports - - VisualisationVisualisation

Page 8: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Numb

er: 8

Visualisation Visualisation

+ excellent capability excellent capability good capabilitygood capability - some capability “blank” no capabilitysome capability “blank” no capability

Page 9: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Numb

er: 9

Criteria for evaluating DM toolsCriteria for evaluating DM tools

Learnability/UsabilityLearnability/Usability shows how easy a tool is to learn and use shows how easy a tool is to learn and use

- Tutorials- Tutorials- Wizards- Wizards

- Easy to learn- Easy to learn- User’s - User’s

manualmanual - Online help- Online help- -

Interface Interface

Page 10: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 10

Criteria for evaluating DM toolsCriteria for evaluating DM tools

InteroperabilityInteroperability shows a tool’s ability to interface shows a tool’s ability to interface with other computer applicationswith other computer applications

- Importing data- Importing data- Exporting data- Exporting data

- Links to other applications- Links to other applications

Flexibility Flexibility

- Model adjustment flexibility- Model adjustment flexibility- Customizable work - Customizable work

enviromentenviroment - Ability to - Ability to write or change codewrite or change code

Page 11: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 11

Data Input & Output ModelData Input & Output Model

+ excellent capability excellent capability good capabilitygood capability - some capabilitysome capability “ “blank” no capabilityblank” no capability

Page 12: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 12

A classification of data setsA classification of data sets

Pima Indians Diabetes data setPima Indians Diabetes data set– 768 cases of Native American women from the Pima tribe 768 cases of Native American women from the Pima tribe some of some of

whom are diabetic, most of whom are not whom are diabetic, most of whom are not – 8 attributes plus the binary class variable for diabetes per instance8 attributes plus the binary class variable for diabetes per instance

Wisconsin Breast Cancer data set Wisconsin Breast Cancer data set – 699 instances of breast tumors 699 instances of breast tumors some of some of

which are malignant, most of which are benignwhich are malignant, most of which are benign– 10 attributes plus the binary malignancy variable per case10 attributes plus the binary malignancy variable per case

The Forensic Glass Identification data set The Forensic Glass Identification data set – 214 instances of glass collected during crime investigations 214 instances of glass collected during crime investigations – 10 attributes plus the multi-class output variable per instance10 attributes plus the multi-class output variable per instance

Moon Cannon data set Moon Cannon data set – 300 solutions to the equation:300 solutions to the equation:

x = 2v 2 sin(g)cos(g)/g x = 2v 2 sin(g)cos(g)/g – the data were generated without adding noisethe data were generated without adding noise

Page 13: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 13

Evaluation of forteen DM toolsEvaluation of forteen DM tools

Page 14: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 14

Strenghts and WeaknessesStrenghts and Weaknesses

StrengthsStrengths Ease of use Ease of use

(Scenario, WizWhy..)(Scenario, WizWhy..) Data visualisation Data visualisation (S-(S-

plus,MineSet...)plus,MineSet...) Depth of algorithms (tree options) Depth of algorithms (tree options)

(CART,See5,S-plus..)(CART,See5,S-plus..) Multiplte neural network Multiplte neural network

architectures architectures (NeuroShell)(NeuroShell)

WeaknessesWeaknesses Difficult file I/O Difficult file I/O

(OLPARS,CART)(OLPARS,CART) Limited visualisationLimited visualisation

(PRW,See5,WizWhy)(PRW,See5,WizWhy) Narrow analyses path Narrow analyses path

(Scenario)(Scenario)

Page 15: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 15

How to improve How to improve existingexisting DM DM applicationsapplications

The top ten points:The top ten points: Database integrationDatabase integration

– no more flat filesno more flat files

– use the millions $ spent on data warehousinguse the millions $ spent on data warehousing

Automated model scoringAutomated model scoring

– without scoring DM is pretty uselesswithout scoring DM is pretty useless – should be integrated with the driving applicationsshould be integrated with the driving applications

Exporting models to other applicationsExporting models to other applications

– close the loop between DM and applications close the loop between DM and applications that need to use the results (scores) that need to use the results (scores)

Page 16: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 16

How to improve How to improve existingexisting DM applications DM applications

Business templatesBusiness templates

– cross-selling specific application is more valuable cross-selling specific application is more valuable than a general modeling toolthan a general modeling tool

Effort knobEffort knob

– it is relevant in a way that tuning parametars are notit is relevant in a way that tuning parametars are not Incorporate financial informationIncorporate financial information

– the financial information is very important and often available the financial information is very important and often available and shold be provided as input to the DM application and shold be provided as input to the DM application

Page 17: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 17

How to improve How to improve existingexisting DM applications DM applications

Computed target columnsComputed target columns

– allow the user to interactively create a new target variableallow the user to interactively create a new target variable Time-series dataTime-series data

– a year’s worth of monthly balance information is qualitatively a year’s worth of monthly balance information is qualitatively different than twelve distinct non-time-series variablesdifferent than twelve distinct non-time-series variables

Use versus ViewUse versus View

– do not present visually to user the full model,do not present visually to user the full model, only the most important levels only the most important levels

WizardsWizards

– not necessarily but desirablenot necessarily but desirable

– prevent human error by keeping the user on trackprevent human error by keeping the user on track

Page 18: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 18

Potential ApplicationsPotential Applications

Data mining has many varied fields of application, Data mining has many varied fields of application,

some of which are listed below.some of which are listed below.

Retail/MarketingRetail/Marketing

Identify buying patterns from customers Identify buying patterns from customers

Find associations among customer demographic characteristics Find associations among customer demographic characteristics

Predict response to mailing campaigns Predict response to mailing campaigns

Market basket analysis Market basket analysis

Page 19: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 19

Potential ApplicationsPotential Applications

• BankingBanking

Detect patterns of fraudulent credit card use Detect patterns of fraudulent credit card use

Identify `loyal' customers Identify `loyal' customers

Determine credit card spending by customer groups Determine credit card spending by customer groups

Find hidden correlations between different financial indicators Find hidden correlations between different financial indicators

Identify stock trading rules from historical market data Identify stock trading rules from historical market data

Page 20: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 20

Potential ApplicationsPotential Applications

• Insurance and Health CareInsurance and Health Care

Claims analysis - i.e., which medical procedures are claimed together Claims analysis - i.e., which medical procedures are claimed together

Predict which customers will buy new policies Predict which customers will buy new policies

Identify behaviour patterns of risky customers Identify behaviour patterns of risky customers

Identify fraudulent behaviour Identify fraudulent behaviour

Page 21: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 21

Potential ApplicationsPotential Applications

• TransportationTransportation

Determine the distribution schedules among outlets Determine the distribution schedules among outlets

Analyse loading patterns Analyse loading patterns

• MedicineMedicine

Characterise patient behaviour to predict office visits Characterise patient behaviour to predict office visits

Identify successful medical therapies for different illnessesIdentify successful medical therapies for different illnesses

To predict the effectiveness of surgical procedures or To predict the effectiveness of surgical procedures or medical tests medical tests

Page 22: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 22

Potential ApplicationsPotential Applications

• SportSport

To make the best choice about players in different circumstanceTo make the best choice about players in different circumstance

To predict the results of relevance matchTo predict the results of relevance match

Do a better list of seed players in groups or tournamentDo a better list of seed players in groups or tournament

DM report from an NBA gameDM report from an NBA game

When Price was Point-Guard, J.Williams missed 0% (0) of his jump field-goal attempts and made 100% (4) of his jump field-goal-attempts.

The total number of such field-goal-attempts was 4.

Page 23: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 23

DM and Customer Relationship ManagementDM and Customer Relationship Management

CRM is a process that manages the interactions CRM is a process that manages the interactions between a company and its customersbetween a company and its customers

Users of CRM software applications are database marketersUsers of CRM software applications are database marketers Goals of database marketers are:Goals of database marketers are:

identifying market segments, which requires significant data identifying market segments, which requires significant data about prospective customers and their buying behaviors about prospective customers and their buying behaviors

build and execute campaignsbuild and execute campaigns

Tightly integrating the two disciplines presents an opportunity Tightly integrating the two disciplines presents an opportunity for companies to gain competetive adventage for companies to gain competetive adventage

Page 24: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 24

DM and Customer Relationship ManagementDM and Customer Relationship Management

How Data Mining helps Database MarketingHow Data Mining helps Database Marketing ScoringScoring The role of Campaign Management SoftwareThe role of Campaign Management Software Increasing the customer lifetime valueIncreasing the customer lifetime value Combining Data Mining and Campaign ManagementCombining Data Mining and Campaign Management

Page 25: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 25

DM and Customer Relationship ManagementDM and Customer Relationship Management

Evaluating the benefits of a Data Mining modelEvaluating the benefits of a Data Mining model

Gains chart Profability chart

Page 26: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 26

Data Mining ExamplesData Mining Examples

Bass Brewers Bass Brewers “We’ve been brewing beer since 1777, with increased competition “We’ve been brewing beer since 1777, with increased competition comes a demand to make faster better informed decision”comes a demand to make faster better informed decision”

Northern BankNorthern Bank “The “The information is now more accessible, paperless and timely.”information is now more accessible, paperless and timely.”

TSB Group Plc TSB Group Plc “We are “We are using Holos because of its flexibility and its excellent multidimensional using Holos because of its flexibility and its excellent multidimensional database”database”

Page 27: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 27

Data Mining ExamplesData Mining Examples

Delphic Universites Delphic Universites “Real value is added to data by multidimensional manipulation “Real value is added to data by multidimensional manipulation (being able to to easily compare many different views (being able to to easily compare many different views of the avaible information in one report) and by modeling.” of the avaible information in one report) and by modeling.”

Harvard - Holden Harvard - Holden “Sybase technology has allowed us to develop an information “Sybase technology has allowed us to develop an information

system that will preserve this legacy into the twenty-first century”system that will preserve this legacy into the twenty-first century” J.P.Morgan J.P.Morgan

“The promise of data mining tools like Information Harvester is “The promise of data mining tools like Information Harvester is that they are able to quickly wade through massive amounts that they are able to quickly wade through massive amounts of data to identify relationships or trending information of data to identify relationships or trending information

that would not have been avaible without the tool”that would not have been avaible without the tool”

Page 28: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 28

Case study of Breast Cancer Survival AnalysisCase study of Breast Cancer Survival Analysis

Case study of the influence of various patient characteristics Case study of the influence of various patient characteristics on survival rates for breast canceron survival rates for breast cancer

The survival analysis technique employed is Cox Regression The survival analysis technique employed is Cox Regression (this technique is useful in situations, (this technique is useful in situations,

where some of the patients do not die during the where some of the patients do not die during the observation period)observation period)

Linear regression techniqueLinear regression technique (if all patients had died during the observation period)(if all patients had died during the observation period)

Page 29: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 29

Case study of Breast Cancer Survival AnalysisCase study of Breast Cancer Survival Analysis

The observation period runs for 133.8 monthsThe observation period runs for 133.8 months The modeling sample contains 746 patients The modeling sample contains 746 patients

(50 patients died during the observation period and 696 (50 patients died during the observation period and 696 who survived beyond the end of the observation who survived beyond the end of the observation

period)period) In this example, we are testing only four predictors: In this example, we are testing only four predictors:

Age, in years, at the start of the observation period (22 to 88)Age, in years, at the start of the observation period (22 to 88) Pathological tumor size, in centimeters (0.10 to 7.00)Pathological tumor size, in centimeters (0.10 to 7.00) Number of positive axillary lymph nodes (0 to 35)Number of positive axillary lymph nodes (0 to 35) Estrogen receptor status (positive vs. negative)Estrogen receptor status (positive vs. negative)

Page 30: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 30

Case study of Breast Cancer Survival AnalysisCase study of Breast Cancer Survival Analysis

The Cox Regression used a backward stepwise likelihood-ratio The Cox Regression used a backward stepwise likelihood-ratio variable selection methodvariable selection method

Significance criteria were set at 0.05 for inclusion in the model, Significance criteria were set at 0.05 for inclusion in the model, and 0.10 for removal from the modeland 0.10 for removal from the model

Printout from the final step of the stepwise regression analysis:Printout from the final step of the stepwise regression analysis:

________________ Variables in the Equation ______________ ________________ Variables in the Equation ______________

Variable B S.E. Wald df Sig R Exp(B)Variable B S.E. Wald df Sig R Exp(B)

AGE -.0314 .0121 6.7486 1 .0094 -.0893 .9691AGE -.0314 .0121 6.7486 1 .0094 -.0893 .9691

PATHSIZE .3975 .1175 11.4476 1 .0007 .1259 1.4881PATHSIZE .3975 .1175 11.4476 1 .0007 .1259 1.4881

LNPOS .1372 .0361 14.4100 1 .0001 .1443 1.1471LNPOS .1372 .0361 14.4100 1 .0001 .1443 1.1471

_______________________________________________________ _______________________________________________________

The column labeled "Sig" shows the statistical significance of included variablesThe column labeled "Sig" shows the statistical significance of included variables

The column labeled "R" shows the degree of unique correlation with the dependent variableThe column labeled "R" shows the degree of unique correlation with the dependent variable

Page 31: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 31

Case study of Breast Cancer Survival AnalysisCase study of Breast Cancer Survival Analysis

Some key things to note are: Some key things to note are:

Estrogen status was removed as a predictor because Estrogen status was removed as a predictor because it did not reach the 0.05 significance criterion for inclusion it did not reach the 0.05 significance criterion for inclusion

Number of positive axillary lymph nodes was the strongest Number of positive axillary lymph nodes was the strongest predictor of survival rates predictor of survival rates (R=.1443 / Sig=.0001)(R=.1443 / Sig=.0001), , then follow pathological tumor size then follow pathological tumor size (R=.1259 / Sig.=.0007)(R=.1259 / Sig.=.0007), , over the over the course of the observation periodcourse of the observation period

Age, although significant, is somewhat less influential Age, although significant, is somewhat less influential than the other two predictors than the other two predictors (R=-0.893 / Sig.=.0094)(R=-0.893 / Sig.=.0094)

Note that both the number of positive axillary lymph nodes and Note that both the number of positive axillary lymph nodes and the pathological tumor size are positively correlated, which means the pathological tumor size are positively correlated, which means that they are directly associated with more rapid mortality. that they are directly associated with more rapid mortality.

Age is negatively correlated with the dependent variable, which Age is negatively correlated with the dependent variable, which means that younger age is predictive of somewhat longer survival.means that younger age is predictive of somewhat longer survival.

Page 32: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 32

Case study of Breast Cancer Survival AnalysisCase study of Breast Cancer Survival Analysis

All patients survive through All patients survive through the 10 month of the observation the 10 month of the observation periodperiod

At the fortieth month, At the fortieth month, the mortality rate increases and the mortality rate increases and continues at this fairly constant continues at this fairly constant increased rate increased rate through the forty-fifth month through the forty-fifth month

At the forty-fifty month,At the forty-fifty month, there is a five-month period there is a five-month period without additional mortalitywithout additional mortality

11% of the original sample has 11% of the original sample has dieddied

The following chart shows the cumulative The following chart shows the cumulative survival function during the observation period:survival function during the observation period:

Page 33: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 33

Case study of Breast Cancer Survival AnalysisCase study of Breast Cancer Survival Analysis

Conclusions and ImplicationsConclusions and Implications

The case study presented here is relatively simple, The case study presented here is relatively simple, and is for illustrative purposes only.and is for illustrative purposes only.

With the addition of more candidate predictors With the addition of more candidate predictors (progesterone receptor status, histologic grade, blood type etc.),(progesterone receptor status, histologic grade, blood type etc.), an even more powerful model could emerge.an even more powerful model could emerge.

By understanding the influence of patient characteristics By understanding the influence of patient characteristics on on mortality rates over time, we are in a better position to estimate mortality rates over time, we are in a better position to estimate survival times for individual patients, and to defend using survival times for individual patients, and to defend using different or more aggressive therapeutic approaches for some different or more aggressive therapeutic approaches for some patients. patients.

Page 34: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 34

Securities Brokerage Case StudySecurities Brokerage Case Study

The following four pages are derived The following four pages are derived from a copyrighted case studyfrom a copyrighted case studyoriginally created by SmartDrill Data Mining originally created by SmartDrill Data Mining (Marlborough, MA, U.S.A.).(Marlborough, MA, U.S.A.).

Their website is:Their website is:http://smartdrill.comhttp://smartdrill.com

And the original case study appears in its entirety here:And the original case study appears in its entirety here:http://smartdrill.com/CHAID.htmlhttp://smartdrill.com/CHAID.html

Page 35: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 35

Securities Brokerage Case StudySecurities Brokerage Case Study

Predictive market segmentation model designed to identify Predictive market segmentation model designed to identify and profile high-value brokerage customer segments and profile high-value brokerage customer segments as targets for special marketing communications efforts. as targets for special marketing communications efforts.

The dependent variable for this ordinal CHAID model The dependent variable for this ordinal CHAID model is brokerage account commission dollars during the past 12 is brokerage account commission dollars during the past 12 monthsmonths

We begin by splitting the client's entire customer file We begin by splitting the client's entire customer file into a modeling sample and a validation sample. into a modeling sample and a validation sample. (Once the (Once the model is built using the modeling sample, model is built using the modeling sample, we we apply it to the validation sample to see how well it works apply it to the validation sample to see how well it works on a sample other than the one on which it was built). on a sample other than the one on which it was built).

Page 36: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 36

Securities Brokerage Case StudySecurities Brokerage Case Study

The resulting CHAID model has 55 segments. The resulting CHAID model has 55 segments. However, the results are summarized in the following comb chart, However, the results are summarized in the following comb chart,

showing the segment indexes (indexes of average dollar value)showing the segment indexes (indexes of average dollar value)

Page 37: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 37

Securities Brokerage Case StudySecurities Brokerage Case Study

The part of Gains Chart: Average Annual Brokerage Commission DollarsThe part of Gains Chart: Average Annual Brokerage Commission Dollars

… … … … … … … … ...

Gains chart provides Gains chart provides quantitative detail useful quantitative detail useful for financial and marketing for financial and marketing planning.planning.

We have highlighted the We have highlighted the top 20% of the file in bluetop 20% of the file in blue

The top 20% of the file The top 20% of the file is worth an average is worth an average of about $334 per account, of about $334 per account, which is nearly three times which is nearly three times the average account value the average account value for the entire sample.for the entire sample.

Page 38: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 38

Securities Brokerage Case StudySecurities Brokerage Case Study

Using the data in the gains chart this information, Using the data in the gains chart this information, we can we can better plan our communications/promotion budget. better plan our communications/promotion budget.

In general, the best segments represent customers In general, the best segments represent customers who are experienced, aggressive, self-directed traders. who are experienced, aggressive, self-directed traders.

The other decisions, which the gains chart The other decisions, which the gains chart and the segmentation rules can help us make:and the segmentation rules can help us make:

We might wish to conduct some market research among customers We might wish to conduct some market research among customers in under-performing segments, or among under-performing customers in under-performing segments, or among under-performing customers in the better segmentsin the better segments

We can use the segment definitions to help us identify possible issues We can use the segment definitions to help us identify possible issues and and question areas to include in the surveyquestion areas to include in the survey

Before we try to apply such a model, we perform a validation Before we try to apply such a model, we perform a validation against a holdout sample, to confirm that it is a good model. against a holdout sample, to confirm that it is a good model.

Page 39: Issues in Data Mining Applications -Tutorial- Nemanja Jovanovic, nemko@sezampro.yu Valentina Milenkovic, tina@eunet.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

Page Number: 39

T h e E n dT h e E n d