Data Mining Principles (required for cw,
useful for any project…)- a reminder (?)
Based on Intro to Data Mining:
CRISP-DM
Prof Chris Clifton, Purdue UnivThanks also to Laura Squier, SPSS for some of the material
CS490D 2
Data Mining Process
• Cross-Industry Standard Process for Data Mining (CRISP-DM) – a Methodology, not for Software Engineering, but data-analysis work
• European Community funded effort to develop framework for data mining and text mining tasks
• Goals:– Encourage interoperable tools across entire data
mining process, by defining subtasks– Take the mystery/high-priced expertise out of simple
data mining tasks – anyone can do it! (even students)
CS490D 3
Why Should There be a Standard Process?
• Framework for recording experience– Allows projects to be
replicated, “real science”
• Aid to project planning and management
• “Comfort factor” for new adopters– Demonstrates maturity of
Data Mining
– Reduces dependency on “stars”
The data mining process must The data mining process must be reliable and repeatable by be reliable and repeatable by people with little data mining people with little data mining background.background.
CS490D 4
Why standardize the process?
• CRoss Industry Standard Process for Data Mining• Initiative launched Sept.1996• http://www.crisp-dm.org/ • SPSS/ISL, NCR, Daimler-Benz, OHRA• Funding from European commission• Over 200 members of the CRISP-DM SIG worldwide
– DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries, Syllogic, Magnify, ..
– System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte & Touche, …
– End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, ...– Linkedin.com groups: discussion, job adverts, …
CS490D 5
CRISP-DM
• Non-proprietary• Application/Industry
neutral• Tool neutral• Focus on business issues
and practical problems– As well as technical
analysis• Framework for guidance• Experience base
– Templates and case studies for guidance and analysis
CS490D 6
CRISP-DM: Overview
CS490D 7
CRISP-DM: Phases• Business Understanding
– Understanding project objectives and requirements– Data mining problem definition
• Data Understanding– Initial data collection and familiarization– Identify data quality issues– Initial, obvious results
• Data Preparation– Record and attribute selection– Data cleansing
• Modeling– Run the data analysis and data mining tools
• Evaluation– Determine if results meet business objectives– Identify business issues that should have been addressed earlier
• Deployment– Put the resulting models into practice– Set up for repeated/continuous mining of the data
CS490D 8
BusinessUnderstanding
DataUnderstanding
EvaluationDataPreparation
Modeling
Determine Business ObjectivesBackgroundBusiness ObjectivesBusiness Success Criteria
Situation AssessmentInventory of ResourcesRequirements, Assumptions, and ConstraintsRisks and ContingenciesTerminologyCosts and Benefits
Determine Data Mining GoalData Mining GoalsData Mining Success Criteria
Produce Project PlanProject PlanInitial Asessment of Tools and Techniques
Collect Initial DataInitial Data Collection Report
Describe DataData Description Report
Explore DataData Exploration Report
Verify Data Quality Data Quality Report
Data SetData Set Description
Select Data Rationale for Inclusion / Exclusion
Clean Data Data Cleaning Report
Construct DataDerived AttributesGenerated Records
Integrate DataMerged Data
Format DataReformatted Data
Select Modeling TechniqueModeling TechniqueModeling Assumptions
Generate Test DesignTest Design
Build ModelParameter SettingsModelsModel Description
Assess ModelModel AssessmentRevised Parameter Settings
Evaluate ResultsAssessment of Data Mining Results w.r.t. Business Success CriteriaApproved Models
Review ProcessReview of Process
Determine Next StepsList of Possible ActionsDecision
Plan DeploymentDeployment Plan
Plan Monitoring and MaintenanceMonitoring and Maintenance Plan
Produce Final ReportFinal ReportFinal Presentation
Review ProjectExperience Documentation
Deployment
Phases and Tasks/Reports
CS490D 9
Phases in the DM Process(1)
• Business Understanding:– Statement of Business
Objective– Statement of Data
Mining objective– Statement of Success
Criteria
CS490D 10
Phases in cw DM Process(1)
• Business Understanding:– Business Objective: attract
Language academics to DM (to be our “customers”?)
– Data Mining objective: is domain English classed as UK or US English? (classify by salient features)
– Success Criteria: specific evidence: set of features which classify UK and US training data correctly, used to classify domain data-sets
CS490D 11
Phases in the DM Process(2)
• Data Understanding– Collect data– Describe data– Explore the data – Verify the quality and
identify outliers
CS490D 12
Phases in cw DM Process(2)
• Data Understanding– Select domain corpora to fit
region covered by journal– Describe texts: size,
sources, markup, …– Explore the texts – can you
see any obvious indications they are UK/US?
– Verify the quality (are texts really from your domain? Errors? Repetitions?) and identify outliers (texts which don’t “belong”)
CS490D 13
Phases in the DM Process (3)
Data preparation:
• Can take over 90% of the time
– Consolidation and Cleaning
• table links, aggregation level, missing values, etc
– Data selection
• Remove “noisy” data, repetitions, etc
• Remove outliers?
• Select samples
• visualization tools
– Transformations - create new variables, formats
CS490D 14
Phases in cw DM Process (3)
Data preparation:• May take up to 90% of the time• Select Data • Rationale for Inclusion /
Exclusion: if it isn‘t really from your domain – remove
• Clean Data • Remove repetitions• Remove headers, footers,
tables, pictures etc (BootCat does this automatically)
• Transform Data• Convert to plain text (ditto)• Reduce to word-frequency list,
keyword-freqs can be features in machine-learning
CS490D 15
Phases in the DM Process(4)
• Model building– Selection of the
modeling techniques is based upon the data mining objective
– Modeling can be an iterative process; may model for either description or prediction
CS490D 16
Phases in cw DM Process(4)
• Model building– Data Mining objective: is
domain English classed as UK or US English? (classify by salient features)
– “model” can be Decision Tree (or NN, or other classifier) based on freqs of UK-only terms and US-only terms (and sources used to derive these)
– Data Visualization or On-Line Analytical Processing (OLAP) as well as Data Mining
CS490D 17
Phases in the DM Process(5)
• Model Evaluation– Evaluation of model: how
well it performed, how well it met business needs
– Methods and criteria depend on model type:
• e.g., confusion matrix with classification models, mean error rate with regression models
– Interpretation of model: important or not, easy or hard depends on algorithm
CS490D 18
Phases in cw DM Process(5)
• Model Evaluation– Evaluation of model:
have you found and quantified key differences between UK, US English, to classify domain data?
– Interpretation: don’t just present the results, try to explain possible reasons
CS490D 19
Phases in the DM Process (6)
• Deployment– Determine how the results
need to be utilized– Who needs to use them?– How often do they need to
be used
• Deploy Data Mining results by:– Utilizing results as
business rules– Publishing report for users,
with recommendations to improve their business
CS490D 20
Phases in cw DM Process (6)
• Deployment– Produce a scientific
report: Intro, Methods, Results, Conclusion; PowerPoint Movie Maker YouTube
– Utilizing results as business rules: attract Language researchers to use text mining (as “customers” or collaborators for SoC researchers)
CS490D 21
Why CRISP-DM?
• The data mining process must be reliable and repeatable by people with little data mining skills (e.g. IT Consultants, students?...)
• CRISP-DM provides a uniform framework for – guidelines – experience documentation
• CRISP-DM is flexible to account for differences – Different business/agency problems– Different data
Why DM?: Concept Description
• Descriptive vs. predictive data mining– Descriptive mining: describes concepts or task-
relevant data sets in concise, summarative, informative, discriminative forms
– Predictive mining: Based on data and analysis, constructs models from the data-set, and predicts the trend and properties of unknown data
• Concept description: – Characterization: provides a concise and succinct
summarization of the given collection of data– Comparison: provides descriptions comparing two or
more collections of data
CS490D 23
DM vs. OLAP
• Data Mining: – can handle complex data types of the
attributes and their aggregations– a more automated process
• Online Analytic Processing (visualization): – restricted to a small number of dimension and
measure types– user-controlled process
CS490D 24
CRISP-DM: Summary• Business Understanding
– Understanding project objectives and requirements– Data mining problem definition
• Data Understanding– Initial data collection and familiarization– Identify data quality issues– Initial, obvious results
• Data Preparation– Record and attribute selection– Data cleansing
• Modeling– Run the data mining tools
• Evaluation– Determine if results meet business objectives– Identify business issues that should have been addressed earlier
• Deployment– Put the resulting models into practice– Set up for repeated/continuous mining of the data