a practical look at data preparation jason brown cognicase inc. data mining irmac: data warehouse...
TRANSCRIPT
![Page 1: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/1.jpg)
A Practical Look at Data Preparation
Jason BrownJason BrownCognicase Inc.Cognicase Inc.
Data MiningData Mining
IRMAC: Data Warehouse SIGNovember 5, 2002
![Page 2: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/2.jpg)
2
Agenda
• Crash Course in Data Mining– What– Why– How
• The virtuous cycle– Data Preparation
• Case Study– Background– Going through the cycle
• Data Preparation
• Q&A
![Page 3: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/3.jpg)
3
The Crash Course
•What
•Why
•How
![Page 4: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/4.jpg)
4
Definitions
Data Mining:The process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.
Knowledge Discovery
Data Mining is not Data Warehousing, OLAP etc.
![Page 5: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/5.jpg)
5
Definitions
Modeling:• Not an ER type Data Model• A data mining model is computational, full of
algorithms• A model can be descriptive or predictive.
– A descriptive model helps in understanding underlying processes or behavior.
– A predictive model uses known values (input) to predict an unknown value (output)
![Page 6: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/6.jpg)
6
Two Types of Data Mining
• Directed– Know specifically what we are looking for
• Who is likely to respond to our offer?• What our customers going to be worth to us over their lifetime?
– Model is a Black Box
Input Output
![Page 7: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/7.jpg)
7
Two Types of Data Mining
• Undirected– Not exactly sure what we are looking for
• How should we define our Customer Segments?• What is interesting about all of our point of sale data?
– Model is a Transparent Box
Input Output
![Page 8: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/8.jpg)
8
Modeling Techniques
Decision Trees
If …….. Then ……..
Rule Induction
Neural Networks
Nearest Neighbour
Clustering
![Page 9: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/9.jpg)
9
Modeling TechniquesDecision Trees
• The tree is built based on the input of a training data set– Training Data Set is based on historical data– Over sample the data that reflects your question
• Each record of the Model Set is run through the branches of the tree until the record reaches a leaf
![Page 10: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/10.jpg)
10
Modeling TechniquesDecision Trees
Age < 23
YNIncome < 12 000
N
N
Y
Y
Male ?
28% 37%
![Page 11: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/11.jpg)
11
Modeling TechniquesNeural Networks
• Neural networks are a nonlinear model -similar to a ‘brain’.
• The network is built based on the input of a training set.
• Model sets run through this network will return accurate results based on the patterns identified in the training set.
• Very Complex
![Page 12: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/12.jpg)
12
Modeling TechniquesClustering
• Clustering finds groups of records that are similar.
• For example, customers can be clustered by:– income
– age
– ytd revenue
![Page 13: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/13.jpg)
13
Modeling TechniquesClustering
MaleIncome < 12 000Age < 23Coke Buyers
MaleIncome < 12 000Age < 23Non Coke Buyers
![Page 14: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/14.jpg)
14
Modeling TechniquesNearest Neighbour
• Model is built based on the input of a training set.
• Classifies a record by calculating the distances between the record criteria and the training data set
• Then it assigns the record to the class that is most common among its nearest neighbours
![Page 15: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/15.jpg)
15
Modeling TechniquesNearest Neighbour
Records plotted based on:
IncomeGenderAge
Bought Coke
Bought Coke
Bought Coke
Did Not Did Not
![Page 16: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/16.jpg)
16
Modeling TechniquesRule Induction
• A technique that infers generalizations from the information in the data
IF age < 19 AND purchase is coke THEN 40% purchase chips
• Describes the data, allows us to visualize what is going on
![Page 17: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/17.jpg)
17
The Crash Course
•What
•Why
•How
![Page 18: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/18.jpg)
18
The Reasons to Mine Data
Expenses
IncreaseProfit
Revenues
![Page 19: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/19.jpg)
19
The Reasons to Mine Data
• For Marketing/CRM– Targeting prospects – Predicting future customer behaviour– Costs Revenues
• For Research– Identify drugs likely to be successful– Costs
• For Process Improvement– Identify causes of production failures– Costs
![Page 20: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/20.jpg)
20
The Crash Course
•What
•Why
•How
![Page 21: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/21.jpg)
21
Process
• Many different processes for Data Mining– Vendor Driven
• SAS - SEMMA– Sample, Explore, Modify, Model, Assess
• SPSS - 5 A’s– Assess, Access, Analyze, Act, Automate
– Consulting Companies– The Virtuous Cycle
• Michael Berry and Gordon Linoff
![Page 22: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/22.jpg)
22
Business Problem
Transform Data
Act
Measure
ProcessThe Virtuous Cycle
![Page 23: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/23.jpg)
23
• Define the business problem• Understand the business and the rules• Determine if Data Mining fits the need• Understand the value to the business of
solving the problem
Business Problem
![Page 24: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/24.jpg)
24
Data for Data Mining
• Type of Data Values– Categorical
• Defined set of values• Ontario, Quebec, PEI …
– Ranks• High, Medium, Low• 0 – 20 000, 20 001 – 35 000, 35 001 – 50 000
– Intervals• Date• Time • Temperature
– True Numeric• Values that support numeric operations
![Page 25: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/25.jpg)
25
Transform DataSteps
Identify Data1
Prepare Model Set6
Add Derived Variables5
Transpose to Right Granularity4
Validate & Clean3
Obtain Data2
Conduct Modeling7
![Page 26: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/26.jpg)
26
Transform Data
Identify Data1
Step 1 - Identify Data
• What data is required to meet the modeling need?
• What data is available?
![Page 27: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/27.jpg)
27
Transform DataStep2 - Obtain Data
Identify Data
Obtain Data
1
2
• OLTP• Data Warehouse• Data Marts and OLAP• Self Reported• External
![Page 28: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/28.jpg)
28
Transform DataStep 3: Validate & Clean
Identify Data
Obtain Data
Validate & Clean
1
3
2
• Data Issues:– Missing– Fuzzy– Incorrect– Outliers
• Solutions:– Change Source– Filter Out– Ignore– Integrate– Predict– Derive a New Variable
![Page 29: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/29.jpg)
29
Transform DataStep 4: Transpose to right granularity
Identify Data
Obtain Data
Transpose to Right Granularity
Validate & Clean
1
4
3
2
• Data sets for Data Mining need one view, one record
• Grain must be consistent throughout– Aggregates can be problematic– Atomic data is often required to build data set
• Training data sets cast from point in time of event looking back
![Page 30: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/30.jpg)
30
Transform DataStep 5: Add Derived Variables
Identify Data
Obtain Data
Transpose to Right Granularity
Validate & Clean
Add Derived Variables
1
5
4
3
2
• Combined Columns• Summarizations• Features from Columns• Time Series
![Page 31: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/31.jpg)
31
Transform DataStep 6: Prepare Model Set
Identify Data
Obtain Data
Transpose to Right Granularity
Validate & Clean
Add Derived Variables
Prepare Model Set
1
6
5
4
3
2
• The Actual Input to the modeling
![Page 32: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/32.jpg)
32
Transform DataStep 7: Conduct Modeling
Identify Data
Obtain Data
Transpose to Right Granularity
Validate & Clean
Add Derived Variables
Prepare Model Set
Conduct Modeling
1
6
5
4
3
2
7
• Get our result– Decision Trees– Neural Networks– Clustering – Nearest Neighbour– Rule Induction
![Page 33: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/33.jpg)
33
Act
The Business has to actually do something with the results or what was the point?
Marketing or Retention Campaigns
Business Changes
![Page 34: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/34.jpg)
34
Measure
• Answer 2 Questions– Was the Data Mining effort accurate? – Were the Business Actions successful?
• Use different sets of data to compare real results– Actioned Customers vs. Non Actioned
• Accuracy Types– Absolute
• Our prediction was 80% of Group D would buy Coke and 78% really did
– Relative• Our prediction was 80% of Group D would buy Coke but
57 % really did, however Group C which we predicted had a 60% propensity to buy Coke actually bought Coke 42% of the time
![Page 35: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/35.jpg)
35
And back around
Business Problem
Transform Data
Act
Measure
![Page 36: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/36.jpg)
36
The Case Study
![Page 37: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/37.jpg)
37
The Case Study
• Background– The Business– Data Warehouse Overview– Strengths and Challenges
• The Project– Business Problem– Transform Data– Act– Measure
![Page 38: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/38.jpg)
38
The Case Study
• Background– The Business– Data Warehouse Overview– Strengths and Challenges
• The Project– Business Problem– Transform Data– Act– Measure
![Page 39: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/39.jpg)
39
The Business
• One of the top 3 (4?) cellular phone providers in Canada
• Recent Acquisitions– Clearnet– Quebectel
• Important Business Concepts– Handset– Subscriber– Client– Activity - Activations, Deactivations– Churn– Usage
![Page 40: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/40.jpg)
40
Cubes
Catalogues
Sources
Staging
Integration (3N F)
Reports
Data Marts(Dimensional)
DW Environment
![Page 41: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/41.jpg)
41
• Commitment to Data Warehousing• Prior Experience in Data Mining• Tools already Established• Strong business support for the outcomes Data
Mining would provide
Strengths at
![Page 42: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/42.jpg)
42
• Data Warehouse still in midst of major re-architecture effort
• Ongoing billing system integration projects• A data mart for data mining had existed
(Clearnet) but it was a victim of both of the above– Successful at Churn Prediction
Challenges at
![Page 43: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/43.jpg)
43
The Case Study
• Background– The Business– Data Warehouse Overview– Strengths and Challenges
• The Project– Business Problem– Transform Data– Act– Measure
![Page 44: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/44.jpg)
44
Using the Virtuous Cycle
Business Problem
Transform Data
Act
Measure
![Page 45: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/45.jpg)
45
Business Problems
• Churn Modeling– predict which subscriber is likely to leave
• Behavioural Segmentation– clustering subscribers into subgroups based on some
commonality– revenue, usage, demographic
• Client Value Estimation– the present value of all future profits generated throughout
the lifetime of that client
![Page 46: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/46.jpg)
46
Transform Data Steps
Identify Data
Obtain Data
Transpose to Right Granularity
Validate & Clean
Add Derived Variables
Prepare Model Set
Conduct Modeling
1
6
5
4
3
2
7
![Page 47: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/47.jpg)
47
Identify Data
EVERY POSSIBLE VARIABLE RELATED TO A SUBSCRIBER!
• Business Wanted:
1
• They provided a detailed list, by subject area, of the variables that they believed were required to conduct the kind of Data Mining Activities desired.
![Page 48: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/48.jpg)
48
• 19 Subject Areas identified with up to 75 variables each - What is the Priority?
• Avoid the big bang - How much can we actually do?
• Where to source the data from?• Resources - Who is going to do it?
Identify DataIT Challenges
1
![Page 49: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/49.jpg)
49
• First we asked the business to rate each variable as H, M or L priority– Almost everything was given an H
• Then we asked the business to rank the subject areas in order of importance– Hard to convince them of the value– Hard to find consensus– Necessary for determining a release strategy
Identify DataPrioritizing
1
![Page 50: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/50.jpg)
50
Obtain Data
• For each Subject Area and each variable we assessed and documented the following:– Where can it be sourced from (and when)?– What are the known issues?– Q&A back and forth on the variables with business– Identified possible additional variables
2
![Page 51: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/51.jpg)
51
+ =Release Strategy:•1 Release a Quarter
•3 planned releases
•15 of 19 Subject Areas
Obtain Data
2
![Page 52: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/52.jpg)
52
Obtain Data
• Data Mining Access tool against sources– Data Warehouse– OLTP– External Data
• Creation of flat files to feed to Data Mining team– Multi source– Validate and Clean
• Build a Data Mart for Data Mining– Part of the Data Warehouse Architecture
2
![Page 53: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/53.jpg)
53
Obtain Data
Sources
Staging
Integration (3N F)
Data Marts(Dimensional)
Data Mart forData Mining
![Page 54: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/54.jpg)
54
Validate & Clean Data
Validate & Clean3
Staging
Integration (3N F)
Data Marts
Data Mart forData Mining
• Clean?– Not Fuzzy– Correct - mostly– Missing and Outliers
![Page 55: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/55.jpg)
55
Transpose to Right Granularity
• The Grain is Subscriber by Bill Cycle• Monthly Snapshots of subscriber Data• Fact Data Cast to Bill Cycle
Transpose to Right Granularity4
BILLED_AMOUNTS
SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_BILLED_REVENUE_FACTSUPDATE_DTLOAD_DT
PRODUCT_SKU
PRODUCT_SKU_KEY
PRODUCT_INFOUPDATE_DTLOAD_DT
SUBSCRIBER_BILL_CYCLE_SNAPSHOT
SUBSCRIBER_IDMONTH
CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT
VALUE_ADDED_SERVICES
VALUE_ADDED_SERVICES_KEY
VALUE_ADDED_SERVICES_DESVALUE_ADDED_SERVICES_CDVALUE_ADDED_SERVICES_IDUPDATE_DTLOAD_DT
VALUE_ADDED_SERVICES_BILLED
MONTH (FK)SUBSCRIBER_ID (FK)VALUE_ADDED_SERVICES_KEY (FK)
VALUE_ADDED_SERVICES_AMTUPDATE_DTLOAD_DT
FACTS_5SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
FACTS_4SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
FACTS_3SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
FACTS_2SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DTSUBSCRIBER_BILL_CYCLE_SNAPSHOT
SUBSCRIBER_IDMONTH
CLIENT_ACCOUNT_ID (FK)CLIENT_SNAPSHOT_MONTH (FK)CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT
CLIENT_ACCOUNT_BILL_CYCLE_SNAPSHOTCLIENT_ACCOUNT_IDCLIENT_SNAPSHOT_MONTH
INFO_ON_THE_ACCOUNTACCOUNT_NO_OF_CANCELLED_SUBSACCOUNT_NO_OF_SUSPENDED_SUBSACCOUNT_NO_OF_ACTIVE_SUBSLOAD_DTUPDATE_DT
![Page 56: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/56.jpg)
56
Add Derived Variables
• Combined Columns– PERCENT_BUCKET_USED
• IN_BUCKET_CALLS / RATE_PLAN_BUCKET_MINUTES)
• Summarizations– Sphere of Influence
• UNIQUE_CALLED_NUMBER_CNT• UNIQUE_CALLING_NUMBER_CNT
Add Derived Variables5
![Page 57: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/57.jpg)
57
Add Derived Variables
• Features from Columns– CONTRACT_INDICATOR (Y/N Flag)
• Was the the date of snapshot for that subscriber between COMMIT_START_DATE and COMMIT_END_DATE
• Time Series
Add Derived Variables5
BLOCKED_USAGE_WORK
SUBSCRIBER_ID: NUMBER(22)USAGE_DT: DATE
BLOCKED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE
BLOCKED_USAGE
MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)
BLOCKED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE
VALUE_ADDED_SERVICES_BILLED
MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)VALUE_ADDED_SERVICES_KEY: NUMBER
VALUE_ADDED_SERVICES_AMT: NUMBER(22,2)UPDATE_DT: DATELOAD_DT: DATE
SUBSCRIBER_BILL_CYCLE_SNAPSHOT
SUBSCRIBER_ID: NUMBER(22)MONTH: NUMBER(8)
CLIENT_ACCOUNT_ID: NUMBER(22)CLIENT_SNAPSHOT_MONTH: NUMBER(8)SUBSCRIBER_CD: VARCHAR2(50)TIME_REMAINING_ON_CONTRACT: NUMBERCONTRACT_END_DT: DATECONTRACT_START_DT: DATECONTRACT_TERM: NUMBERCONTRACT_IND: VARCHAR2(1)SUBSCRIBER_STATUS_CD: VARCHAR2(50)ACTIVATION_CHANNEL_ORG_CD: VARCHAR2(50)ACTIVATION_REASON_CD: VARCHAR2(50)ACTIVATION_DT: DATEPOSTAL_CODE: VARCHAR2(20)PROVINCE_STATE_CD: VARCHAR2(50)LAST_NM: VARCHAR2(60)LANGUAGE_CD: VARCHAR2(50)BIRTHDATE: DATEGENDER: VARCHAR2(1)DRIVERS_LICENSE: VARCHAR2(20)FIRST_NM: VARCHAR2(50)SOCIAL_INSURANCE_NO: VARCHAR2(20)MOBILE_PHONE_NO: VARCHAR2(50)DEACTIVATION_REASON_CD: VARCHAR2(50)DEACTIVATION_DT: DATEREVENUE_BAND_CD: VARCHAR2(50)SEGMENT_CD: VARCHAR2(50)SUSPEND_REASON_CD: VARCHAR2(50)SUSPEND_DT: DATECURRENT_HANDSET_KEY: NUMBERFIRST_HANDSET_KEY: NUMBERUPDATE_DT: DATELOAD_DT: DATEAGE_OF_CURRENT_HANDSET: NUMBERPREPAID_IND: VARCHAR2(1)REVENUE_IND: VARCHAR2(1)RATE_PLAN_BUCKET_MINUTES: NUMBER(9,2)RATE_PLAN_GROUP_CD: VARCHAR2(50)RATE_PLAN_CD: VARCHAR2(50)SUBSCRIBER_TENURE_DAYS: NUMBERSUBSCRIPTION_ID: VARCHAR2(50)TECHNOLOGY_TYPE_CD: VARCHAR2(50)AGE: NUMBER(4,1)
PRODUCT_SKU_KD2
PRODUCT_SKU_KEY: NUMBER
PRODUCT_GROUP_DES: VARCHAR2(240)PRODUCT_GROUP_CD: VARCHAR2(50)PRODUCT_GROUP_ID: VARCHAR2(50)ENGLISH_PRODUCT_DES: VARCHAR2(240)PRODUCT_CD: VARCHAR2(50)REFURBISHED_HANDSET_IND: VARCHAR2(1)WEB_READY_IND: VARCHAR2(1)SIM_CARD_IND: VARCHAR2(1)PRODUCT_ID: VARCHAR2(50)UPDATE_DT: DATELOAD_DT: DATE
DROPPED_USAGE_WORK
SUBSCRIBER_ID: NUMBER(22)USAGE_DT: DATE
DROPPED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE
DROPPED_USAGE
MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)
DROPPED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE
BUCKET_USAGE
MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)
UPDATE_DT: DATELOAD_DT: DATEPERCENT_BUCKET_USED: NUMBER
BILLED_USAGE_3_MTH_AVG
MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)
TOTAL_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)LONG_DISTANCE_CALLS_MOU_3_MTH_: NUMBER(22,5)WEEKEND_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)EVENING_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)PEAK_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)UPDATE_DT: DATELOAD_DT: DATE
BILLED_USAGE
MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)
TOTAL_CALLS_MOU: NUMBER(22,5)TOTAL_CALLS_CNT: NUMBERLONG_DISTANCE_CALLS_MOU: NUMBER(22,5)LONG_DISTANCE_CALLS_CNT: NUMBERWEEKEND_CALLS_MOU: NUMBER(22,5)WEEKEND_CALLS_CNT: NUMBEREVENING_CALLS_MOU: NUMBER(22,5)EVENING_CALLS_CNT: NUMBERPEAK_CALLS_MOU: NUMBER(22,5)PEAK_CALLS_CNT: NUMBERUPDATE_DT: DATELOAD_DT: DATEOUT_BUCKET_CALLS_MOU: NUMBER(22,5)IN_BUCKET_CALLS_MOU: NUMBER(22,5)OUT_BUCKET_CALLS_CNT: NUMBERIN_BUCKET_CALLS_CNT: NUMBERROAMING_CALLS_MOU: NUMBER(22,5)ROAMING_CALLS_CNT: NUMBER
SUBSCRIBER_ID_KD2_TEMP
SUBSCRIBER_ID: NUMBER(22)
ACCOUNT_CYCLE_BILL_DAY: NUMBER(2)
![Page 58: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/58.jpg)
58
Prepare Model Set
Prepare Model Set6
FACTS_5
SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
FACTS_4
SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
FACTS_3
SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
FACTS_2
SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
FACTS_1
SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
BILLED_AMOUNTS
SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_BILLED_REVENUE_FACTSUPDATE_DTLOAD_DT
DROPPED_AND_BLOCKED_USAGE
DROPPED_CALLS_CNTUPDATE_DTLOAD_DT
EQUIPMENT_ACTIVITY_SNAPSHOT
SUBSCRIBER_ID (FK)MONTH (FK)
HANDSET_COUNTABLE_DATAUPDATE_DTLOAD_DT
PRODUCT_SKU
PRODUCT_SKU_KEY
PRODUCT_INFOUPDATE_DTLOAD_DT
SUBSCRIBER_BILL_CYCLE_SNAPSHOT
SUBSCRIBER_IDMONTH
CLIENT_ACCOUNT_ID (FK)CLIENT_SNAPSHOT_MONTH (FK)CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT
VALUE_ADDED_SERVICES
VALUE_ADDED_SERVICES_KEY
VALUE_ADDED_SERVICES_DESVALUE_ADDED_SERVICES_CDVALUE_ADDED_SERVICES_IDUPDATE_DTLOAD_DT
VALUE_ADDED_SERVICES_BILLED
MONTH (FK)SUBSCRIBER_ID (FK)VALUE_ADDED_SERVICES_KEY (FK)
VALUE_ADDED_SERVICES_AMTUPDATE_DTLOAD_DT
BLOCKED_USAGE
BLOCKED_CALLS_CNTUPDATE_DTLOAD_DT
CLIENT_ACCOUNT_BILL_CYCLE_SNAPSHOT
CLIENT_ACCOUNT_IDCLIENT_SNAPSHOT_MONTH
INFO_ON_THE_ACCOUNTACCOUNT_NO_OF_CANCELLED_SUBSACCOUNT_NO_OF_SUSPENDED_SUBSACCOUNT_NO_OF_ACTIVE_SUBSLOAD_DTUPDATE_DT
08/01/2002 1,735,436.26 281,032.21 388,970.91 0.00 516,911.31 462,667.55 749,330.77 6,371.95 500,183.42 08/14/2002
• Select Sample– Population– Data for given
modeling effort
• Denormalize completely
![Page 59: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/59.jpg)
59
Prepare Model Set
Prepare Model Set6
Denormalize completely
Example of Denormalization for Data Mining
Handset 1 Handset 2 Handset 3 Handset 40 1 0 0
BILLED_USAGE
MONTH (FK)SUBSCRIBER_ID (FK)
TOTAL_CALLS_MOUTOTAL_CALLS_CNTLONG_DISTANCE_CALLS_MOULONG_DISTANCE_CALLS_CNTWEEKEND_CALLS_MOUWEEKEND_CALLS_CNTEVENING_CALLS_MOUEVENING_CALLS_CNTPEAK_CALLS_MOUPEAK_CALLS_CNTUPDATE_DTLOAD_DTOUT_BUCKET_CALLS_MOUIN_BUCKET_CALLS_MOUOUT_BUCKET_CALLS_CNTIN_BUCKET_CALLS_CNTROAMING_CALLS_MOUROAMING_CALLS_CNT
BILLED_USAGE_3_MTH_AVG
MONTH (FK)SUBSCRIBER_ID (FK)
TOTAL_CALLS_MOU_3_MTH_AVGLONG_DISTANCE_CALLS_MOU_3_MTH_WEEKEND_CALLS_MOU_3_MTH_AVGEVENING_CALLS_MOU_3_MTH_AVGPEAK_CALLS_MOU_3_MTH_AVGUPDATE_DTLOAD_DT
BUCKET_USAGE
MONTH (FK)SUBSCRIBER_ID (FK)
UPDATE_DTLOAD_DTPERCENT_BUCKET_USED
PRODUCT_SKU
PRODUCT_SKU_KEY
PRODUCT_INFOUPDATE_DTLOAD_DT
SUBSCRIBER_BILL_CYCLE_SNAPSHOT
SUBSCRIBER_IDMONTH
CLIENT_ACCOUNT_ID (FK)CLIENT_SNAPSHOT_MONTH (FK)CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT
VALUE_ADDED_SERVICES
VALUE_ADDED_SERVICES_KEY
VALUE_ADDED_SERVICES_DESVALUE_ADDED_SERVICES_CDVALUE_ADDED_SERVICES_IDUPDATE_DTLOAD_DT
VALUE_ADDED_SERVICES_BILLED
MONTH (FK)SUBSCRIBER_ID (FK)VALUE_ADDED_SERVICES_KEY (FK)
VALUE_ADDED_SERVICES_AMTUPDATE_DTLOAD_DT
![Page 60: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/60.jpg)
60
Prepare Model Set
Prepare Model Set6
Example of Casting a a Test Set for Churn Modeling
• Sliding Windows Concept
O c t N o v D e c J a n F e b M a r A p r M a y J u n J u l
M o d e l S e t F e b 3 2 1 X PM o d e l S e t M a r 3 2 1 X PM o d e l S e t A p r 3 2 1 X P
S c o r e S e t 3 2 1 X P P
![Page 61: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/61.jpg)
61
Conduct Modeling
7
• Done by business team with help from Vendor (SAS)– Decision Trees and Neural Networks (Churn)– Clustering (Segmentation)
Decision Trees
Neural Networks
Clustering
![Page 62: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/62.jpg)
62
Various Marketing and CRM Activities
The Action
![Page 63: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/63.jpg)
63
Measuring
• Not there yet– Intend to compare actioned vs. non-actioned results– 50 % to be actioned
![Page 64: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0f5503460f949e5255/html5/thumbnails/64.jpg)
64
And the cycle continues ...
Business Problem
Transform Data
Act
Measure