how to recognize and fix data faults that can make or...
TRANSCRIPT
Cary S. Shaw & Associates, LLC
How to Recognize
and Fix Data Faults
That Can Make or Break Analytics
Data
Interpretation
Sampling
By Cary S. Shaw & Associates, LLC
For further information contact (203)505-3180
© Copyright Cary S. Shaw 2013
For dissemination, permission required from copyright holder.
Cary S. Shaw & Associates, LLC
Get the Data Quality Right
Business depends on data.
Whether you are:
Requestor
Developer
Producer
Interpreter
Analytics Consultant
Analytics Recipient
Cary S. Shaw & Associates, LLC
We will cover:
Qualify
Sniff, Define, Frequency Check
Interpret
Lurk, Miss, Population
Sample
Methods, Application
Summary
Tools, Lessons
Cary S. Shaw & Associates, LLC
How many products are in company’s
data base?
Marketing client sent request to IT
Detail result provided in Excel.
Marketing client relayed file: Answer: 1,000.
Cary S. Shaw & Associates, LLC
How many products are in company’s
data base?
Marketing client sent request to IT
Detail result provided in Excel.
Marketing client relayed file: Answer: 1,000.
What happened: SQL Navigator timed out.
Real answer: 3,646.
Cary S. Shaw & Associates, LLC
Spreadsheet data
received in this form.
Business Name Street City State Zip Phone
xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx
xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx
xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx
xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx
xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx
xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx
xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx
xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx
xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx
xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx
xxxxxxxxxx xxxxxxxxxx xxxxx xx
Cary S. Shaw & Associates, LLC
Assume data is in error.
Search for errant values.
Frequency tables
Outliers - Confidence intervals
Outliers - One variable graphed/estimated against
another
Referential Integrity (cross table connections within an
integrated database)
Sniff test
Don’t dismiss data if it is different; only if it is wrong.
Cary S. Shaw & Associates, LLC
How unique are our customer phone
numbers?
Are we really reaching the person / business we intend?
Are we alienating by multiple calls?
Are we missing multiple clients?
Practical question: How frequently does the same phone
number appear on our customer contact file?
Cary S. Shaw & Associates, LLC
Exploring Dividend Yields
Found company with 7.2 % dividend yield
What’s going on?
Cary S. Shaw & Associates, LLC
Could a ‘9’ or ‘0’
mean something else?
Cary S. Shaw & Associates, LLC
How do we want to select business
vs. home prospects?
H = Home
B = Business
And how do we want to convert this to a numeric
factor?
Cary S. Shaw & Associates, LLC
Select B (business) or Reject H
(home)? Different impact.
Other values have impact
Is field up-to-date?
Is client (blank) different
from proprietary (blank)?
Is (blank) or misspelling a
proxy for other
characteristics?
How about Duns Industry
sector?
Business, Home Frequency
(blank) 98,947
B 5,032
b 159
Business 49
Busn 1
H 24,891
h 475
Home 63
Hume 1
Hme 11
Cary S. Shaw & Associates, LLC
Riddles
Product Usage a predictor of retention
Optimum student loan amount
Effect of sales channel on retention
Longevity dependency on location
Cary S. Shaw & Associates, LLC
Fitting a line: effect of product usage on retention
Large database
Cary S. Shaw & Associates, LLC
Effect of Product Usage (Line fitted to
data points) Effect of Usage on Cancel Rate??
0 100 200 300 400 500
Meter Usage
Can
cel R
ate
Cancel Rate
Linear (Cancel Rate)
Cancel Rate = f(Usage) = constant
Cary S. Shaw & Associates, LLC
Is ‘zero’ really ‘missing’? Impact
on bining and fitting. Effect of Usage on Cancel Rate??
0 100 200 300 400 500
Meter Usage
Can
cel R
ate
Cancel Rate
Linear (Cancel Rate)
Cary S. Shaw & Associates, LLC
Student loans and loan delinquency
Cary S. Shaw & Associates, LLC
Do small loans result in high
delinquencies? Dismissal factor? Effect of Loan Amount on Delinquency
0.0000%
1.0000%
2.0000%
3.0000%
4.0000%
5.0000%
6.0000%
7.0000%
2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000
Loan Amount
De
lin
qu
en
cy
Rate
.
Loans in Collection
Cary S. Shaw & Associates, LLC
1 out of 10 in the original
population
Cary S. Shaw & Associates, LLC
Looks like 1 out of 4 in the
remaining population
Cary S. Shaw & Associates, LLC
Watch the whole population of interest
(affects the denominator) Effect of Loan Amount on Delinquency
0.0000%
1.0000%
2.0000%
3.0000%
4.0000%
5.0000%
6.0000%
7.0000%
2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000
Loan Amount
De
lin
qu
en
cy
Rate
.
Loans in Collection
Original Loans in Collection
Cary S. Shaw & Associates, LLC
Where you live can affect your life span.
Connecticut ?
Cary S. Shaw & Associates, LLC
Where is Mortality Lowest? Mortality (Prob of Dying Next Yr) As % of U.S. Rate
Alaska............, 57.0
Connecticut......., 102.5
50.0
60.0
70.0
80.0
90.0
100.0
110.0
Reported Mortality Index
Ind
ex (
US
= 1
00)
Alaska............
Connecticut.......
Cary S. Shaw & Associates, LLC
Is Connecticut Mortality Above Avg? Connecticut Mortality (Prob of Dying Next Yr) As % of U.S. Rate
Reported Total
Under 1
1-4
5-14 15-24
25-34
35-44
45-54
55-64
65-74
75-84
85 years
50.0
60.0
70.0
80.0
90.0
100.0
110.0
Age Group
Ind
ex (
US
=100)
Connecticut.......
Cary S. Shaw & Associates, LLC
Compare with Alaska by Age Mortality (Prob of Dying Next Yr) As % of U.S. Rate
Under 1
1-4
5-14
15-24
25-34
35-44
45-54
55-64
65-7475-84
85 years
Under 1
1-4
5-14 15-24
25-34 35-4445-54 55-64
65-7475-84
85 years
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
180.0
200.0
Age Group
Ind
ex (
US
=1
00
)
Alaska............
Connecticut.......
Cary S. Shaw & Associates, LLC
Compare Observed with Age-Adjusted Mortality Rates by State as Percent of U.S.
40.0
60.0
80.0
100.0
120.0
140.0
160.0
180.0
50.0 60.0 70.0 80.0 90.0 100.0 110.0 120.0 130.0 140.0 150.0
Reported Mortality Index
Ag
e A
dju
ste
d M
ort
ality
In
dex
Series1
Linear (Series1)
Alaska
CT
Cary S. Shaw & Associates, LLC
Compare Rankings State Ranks Compared
0
13
26
39
52
0 13 26 39 52
Reported State Rank
Ad
jus
ted
Sta
te R
an
k
RankAdjusted
Linear (RankAdjusted)
Alaska
CT
Cary S. Shaw & Associates, LLC
For full table identifying mortality rate in each state,
contact Cary S. Shaw & Associates, LLC
Cary S. Shaw & Associates, LLC
Riddles
Product Usage a predictor of retention
Optimum student loan amount
Effect of sales channel on retention
Longevity dependency on location
Cary S. Shaw & Associates, LLC
Lessons
Separate cause and effect by chronology.
Find the initial populations under study.
Distinguish between missing and zero values.
Search for significant lurking variables.
Creatively test your own ideas.
Cary S. Shaw & Associates, LLC
Why Sample?
Examine large databases
Rich on detail, economical on resources
Check own analytics
Cross check IT
Cross check vendors
May enable GREATER accuracy because
Enables correction of representative points
Enables understanding of representative points
Cary S. Shaw & Associates, LLC
Which sample is most
representative?
Beginning Middle
Cary S. Shaw & Associates, LLC
Which sample is most
representative?
Beginning Middle Random
Cary S. Shaw & Associates, LLC
Which sample is most
representative?
Beginning Middle Random Stratified Random
Cary S. Shaw & Associates, LLC
Stratified Random is making sure
the sample is representative.
Method A:
Divide data into strata (categories, deciles, etc.)
Pick target number from each strata
Method B:
Sort file by factors known to be important (such as
income, age, even ID number)
Pick every n th record, starting with record n/2
Cary S. Shaw & Associates, LLC
What happens when you match
samples?
How big is the resulting sample?
Year 1 Year 2 Year 3
10% 10% 10% = 0.1%
Customer Product Sales$ Supplies A/R
10% 10% . 10% 10% 10%
= 0.001%
Cary S. Shaw & Associates, LLC
There’s a way to take samples that
match even as the data is changing
(don’t lose by inner joins)
Cary S. Shaw & Associates, LLC
Examples of Record IDs
Social Security Number
Credit Card Number
Account ID
Develop algorithm to transform Record ID into
Random Number.
Then select from each file based on Random Number
range.
Samples from different files will then match.
Cary S. Shaw & Associates, LLC
Example: Duns File Update
How did NAICS code change?
Jan NAICS = April NAICS
Jan NAICS <> April NAICS
No January Record
No April Record
Cary S. Shaw & Associates, LLC
Example: Duns File Update
How did NAICS code change?
Jan NAICS = April NAICS 1006
Jan NAICS <> April NAICS 0
No January Record 0
No April Record 0
Cary S. Shaw & Associates, LLC
Example: Duns File Update
How did NAICS code change?
Jan NAICS = April NAICS 829
Jan NAICS <> April NAICS 107
No January Record 64
No April Record 70
Cary S. Shaw & Associates, LLC
We covered:
Qualify
Sniff, Define, Frequency Check
Interpret Lurk, Miss, Population
Sample
Methods, Application
Summary Tools, Lessons
Cary S. Shaw & Associates, LLC
Lessons
Perform the sniff test.
Check number of records.
Do frequency counts (highest, lowest).
Look for missing values (distinguish from zeros and
duplicates).
Investigate lurking interactive variables.
Analyze what happened to original populations, not
final populations.
Examine representative samples (stratified preferred).
Assume errors and look for them. (Apply G.I.V.O.).
Cary S. Shaw & Associates, LLC
- FIN -
Avoid the pitfalls of found
data and reach for the Stars
For further information contact:
Cary S. Shaw & Associates, LLC
(203)505-3180