crisp-dm methodology for computational modelling projects
DESCRIPTION
TRANSCRIPT
CRISP-DM: a methodology for applied computational modelling
(required for CMD cw2)
Based on Intro to Data Mining:
CRISP-DM
Prof Chris Clifton, Purdue UnivThanks to Laura Squier, SPSS for some of the material used
CS490D 2
A methodology for projects
• Cross-Industry Standard Process for Data Mining (CRISP-DM): computational modelling of datasets to derive “knowledge”
• European Community funded effort to develop framework for data mining and computational modeling projects
• Goals:– Encourage interoperable tools across entire data mining
process
– Take the mystery/high-priced expertise out of simple data mining and computational modelling tasks
CS490D 3
Why Should There be a Standard Process?
• Framework for recording experience– Allows projects to be
replicated
• Aid to project planning and management
• “Comfort factor” for new adopters– Demonstrates maturity of
computational modelling and data mining
– Reduces dependency on “stars”
The process must be reliable The process must be reliable and repeatable by people with and repeatable by people with little data mining background.little data mining background.
CS490D 4
Process Standardization
• CRoss Industry Standard Process for Data Mining• Initiative launched Sept.1996• http://www.crisp-dm.org/ • SPSS/ISL, NCR, Daimler-Benz, OHRA• Funding from European commission• Over 200 members of the CRISP-DM SIG worldwide
– DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries, Syllogic, Magnify, ..
– System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte & Touche, …
– End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, ...
CS490D 5
CRISP-DM
• Non-proprietary• Application/Industry
neutral• Tool neutral• Focus on business issues
– modelling is core, but the methodology includes steps before and after involving client business
• Framework for guidance• Experience base
– Templates for new applications
CS490D 6
CRISP-DM: Overview
CS490D 7
CRISP-DM: Phases• Business Understanding
– Understanding application domain, project objectives and requirements– Data mining problem definition
• Data Understanding– Initial data collection and familiarization– Identify data quality issues– Initial, obvious results
• Data Preparation– Record and attribute selection– Data cleansing
• Modeling– Run the computational modelling tools, to derive results
• Evaluation– Determine if results meet business objectives– Identify business issues that should have been addressed earlier
• Deployment– Put the resulting models into practice– Set up for repeated/continuous mining of the data
CS490D 8
BusinessUnderstanding
DataUnderstanding
EvaluationDataPreparation
Modeling
Determine Business ObjectivesBackgroundBusiness ObjectivesBusiness Success Criteria
Situation AssessmentInventory of ResourcesRequirements, Assumptions, and ConstraintsRisks and ContingenciesTerminologyCosts and Benefits
Determine Data Mining GoalData Mining GoalsData Mining Success Criteria
Produce Project PlanProject PlanInitial Asessment of Tools and Techniques
Collect Initial DataInitial Data Collection Report
Describe DataData Description Report
Explore DataData Exploration Report
Verify Data Quality Data Quality Report
Data SetData Set Description
Select Data Rationale for Inclusion / Exclusion
Clean Data Data Cleaning Report
Construct DataDerived AttributesGenerated Records
Integrate DataMerged Data
Format DataReformatted Data
Select Modeling TechniqueModeling TechniqueModeling Assumptions
Generate Test DesignTest Design
Build ModelParameter SettingsModelsModel Description
Assess ModelModel AssessmentRevised Parameter Settings
Evaluate ResultsAssessment of Data Mining Results w.r.t. Business Success CriteriaApproved Models
Review ProcessReview of Process
Determine Next StepsList of Possible ActionsDecision
Plan DeploymentDeployment Plan
Plan Monitoring and MaintenanceMonitoring and Maintenance Plan
Produce Final ReportFinal ReportFinal Presentation
Review ProjectExperience Documentation
Deployment
Phases and Tasks
CS490D 9
Example applications
• PAST cw applying CRISP-DM:– Technologies for Knowledge Discovery (MSc
Module, discontinued): analysis of patient data from Duke Hospital referrals
– CMD last year: analysis of Schools League Tables to choose a “good” name for a school
• YOU will also need CRISP-DM for YOUR coursework (not the same)
CS490D 10
Phases in the DM Process (1)
• Business Understanding:– Statement of Business
Objective– Statement of Data
Mining objective– Statement of Success
Criteria
CS490D 11
Phases in TKD’04 cw (1)
• Business Understanding:– Explore patient data,
what can be predicted?
– Explore evidence for/against hypotheses
– Statement of Success Criteria: evidence found?
CS490D 12
Phases in CMD’05 cw (1)
• Business Understanding:– Business Objective:
school name reflecting good performance
– Data Mining objective: find name attributes which predict performance
– Success Criteria: evidence for new school name
CS490D 13
Phases in the DM Process (2)
• Data Understanding– Explore the data and
verify the quality– Find outliers
CS490D 14
Phases in TKD’04 cw (2)
• Data Understanding– Explore the data and
verify the quality: small dataset can be explored with visualization tools
– Find outliers: sparse dataset, many outliers!
CS490D 15
Phases in CMD’05 cw (2)
• Data Understanding– Explore data, decide
metric for good/bad “performance”
– Select and download data for LEAs
– Find outliners, e.g. Special Schools
CS490D 16
Phases in the DM Process (3)
Data preparation:• Takes usually over 90% of the
time– Collection– Assessment– Consolidation and Cleaning
• table links, aggregation level, missing values, etc
– Data selection• active role in ignoring
non-contributory data?• outliers?• Use of samples• visualization tools
– Transformations - create new variables
CS490D 17
Phases in the TKD’04 cw (3)
Data preparation:• Takes usually over 90% of the
time– Not a lot todo: data
supplied in ARFF format!– Transformations - create
new variables – but not in this case.
CS490D 18
Phases in CMD’05 cw (3)Data preparation:• Takes usually over 90% of the
time– Download, save as text-file– Data selection
• ignore non-contributory data e.g. SEN
• Outliers eg Special Schools
• Use of samples?– Transformations - create
new variables: features of name e.g. Grammar, High, town-name, syllables; mapping from existing data to “useful” variables may be non-trivial!
CS490D 19
Phases in the DM Process(4)
• Model(l)ing• Selection of the
modeling techniques is based upon the data mining objective– Modeling can be an
iterative process - different for supervised and unsupervised learning; may model for either description or prediction
CS490D 20
Phases in the TKD’04 cw (4)
• Model building• Try different WEKA
models and options• Easy to “play and see
what happens”• Keep output or
screenshots for models which seem to show evidence
CS490D 21
Phases in CMD’05 cw (4)
• Model building– data mining objective:
find name attributes which predict good performance
– Try various types of model, note any which provide evidence linking an attribute to performance
CS490D 23
Phases in the DM Process(5)
• Model Evaluation– Evaluation of model:
how well it performed on training/test data
– Methods and criteria depend on model type:
• e.g., confusion matrix, mean error rate
– More importantly: are results “useful” for business objective?
CS490D 24
Phases in TKD’04 cw (5)
• Model Evaluation– Evaluation of model:
how well it performed on training/test data
– Methods and criteria depend on model type:
• e.g., confusion matrix, mean error rate
– More importantly: evidence for/against hypotheses about medical predictions
CS490D 25
Phases in CMD’05 cw (5)
• Model Evaluation– Error rates, confusion
matrix with names predicting “good” schools
– Evaluation wrt business objectives: have you found any indicators of a “good” school name?
– Don’t just present the model, explain how it is evidence
CS490D 26
Phases in the DM Process (6)
• Deployment– Report to client– Determine how the
results need to be utilized, who needs to use them?
– How often do they need to be used – should the exercise be repeated?
CS490D 27
Phases in TKD’04 cw (6)
• Deployment– Report to client (me!)– Submit cw, but no
need to think of “future plans”
CS490D 28
Phases in CMD’05 cw (6)
• Deployment– Write a Report, section
for each CRISP-DM Phase, plus Appendices
– Deployment section of report: review of the exercise
• Client should deploy results by:– Utilizing results as
business rules: ?LGS/LGHS merger?
CS490D 29
Why CRISP-DM?
• The data mining process must be reliable and repeatable by people with little data mining skills (e.g. IT consultants, students?...)
• CRISP-DM provides a uniform framework for – guidelines – experience documentation
• CRISP-DM is flexible to account for differences – Different business/agency problems– Different data types (numeric, database, text, ...)
CS490D 32
CRISP-DM: Summary• Business Understanding
– Understanding project objectives and requirements– Data mining problem definition
• Data Understanding– Initial data collection and familiarization– Identify data quality issues– Initial, obvious results
• Data Preparation– Record and attribute selection– Data cleansing
• Modeling– Run the computational modelling tools
• Evaluation– Determine if results meet business objectives– Identify business issues that should have been addressed earlier
• Deployment– Put the resulting models into practice– Set up for repeated/continuous mining of the data