how to recognize and fix data faults that can make or...

44
Cary S. Shaw & Associates, LLC How to Recognize and Fix Data Faults That Can Make or Break Analytics Data Interpretation Sampling By Cary S. Shaw & Associates, LLC For further information contact (203)505-3180 [email protected] © Copyright Cary S. Shaw 2013 For dissemination, permission required from copyright holder.

Upload: others

Post on 25-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

How to Recognize

and Fix Data Faults

That Can Make or Break Analytics

Data

Interpretation

Sampling

By Cary S. Shaw & Associates, LLC

For further information contact (203)505-3180

[email protected]

© Copyright Cary S. Shaw 2013

For dissemination, permission required from copyright holder.

Page 2: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Get the Data Quality Right

Business depends on data.

Whether you are:

Requestor

Developer

Producer

Interpreter

Analytics Consultant

Analytics Recipient

Page 3: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

We will cover:

Qualify

Sniff, Define, Frequency Check

Interpret

Lurk, Miss, Population

Sample

Methods, Application

Summary

Tools, Lessons

Page 4: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

How many products are in company’s

data base?

Marketing client sent request to IT

Detail result provided in Excel.

Marketing client relayed file: Answer: 1,000.

Page 5: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

How many products are in company’s

data base?

Marketing client sent request to IT

Detail result provided in Excel.

Marketing client relayed file: Answer: 1,000.

What happened: SQL Navigator timed out.

Real answer: 3,646.

Page 6: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Spreadsheet data

received in this form.

Business Name Street City State Zip Phone

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx

Page 7: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Assume data is in error.

Search for errant values.

Frequency tables

Outliers - Confidence intervals

Outliers - One variable graphed/estimated against

another

Referential Integrity (cross table connections within an

integrated database)

Sniff test

Don’t dismiss data if it is different; only if it is wrong.

Page 8: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

How unique are our customer phone

numbers?

Are we really reaching the person / business we intend?

Are we alienating by multiple calls?

Are we missing multiple clients?

Practical question: How frequently does the same phone

number appear on our customer contact file?

Page 9: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Exploring Dividend Yields

Found company with 7.2 % dividend yield

What’s going on?

Page 10: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Could a ‘9’ or ‘0’

mean something else?

Page 11: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

How do we want to select business

vs. home prospects?

H = Home

B = Business

And how do we want to convert this to a numeric

factor?

Page 12: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Select B (business) or Reject H

(home)? Different impact.

Other values have impact

Is field up-to-date?

Is client (blank) different

from proprietary (blank)?

Is (blank) or misspelling a

proxy for other

characteristics?

How about Duns Industry

sector?

Business, Home Frequency

(blank) 98,947

B 5,032

b 159

Business 49

Busn 1

H 24,891

h 475

Home 63

Hume 1

Hme 11

Page 13: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Riddles

Product Usage a predictor of retention

Optimum student loan amount

Effect of sales channel on retention

Longevity dependency on location

Page 14: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Fitting a line: effect of product usage on retention

Large database

Page 15: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Effect of Product Usage (Line fitted to

data points) Effect of Usage on Cancel Rate??

0 100 200 300 400 500

Meter Usage

Can

cel R

ate

Cancel Rate

Linear (Cancel Rate)

Cancel Rate = f(Usage) = constant

Page 16: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Is ‘zero’ really ‘missing’? Impact

on bining and fitting. Effect of Usage on Cancel Rate??

0 100 200 300 400 500

Meter Usage

Can

cel R

ate

Cancel Rate

Linear (Cancel Rate)

Page 17: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Student loans and loan delinquency

Page 18: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Do small loans result in high

delinquencies? Dismissal factor? Effect of Loan Amount on Delinquency

0.0000%

1.0000%

2.0000%

3.0000%

4.0000%

5.0000%

6.0000%

7.0000%

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000

Loan Amount

De

lin

qu

en

cy

Rate

.

Loans in Collection

Page 19: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

1 out of 10 in the original

population

Page 20: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Looks like 1 out of 4 in the

remaining population

Page 21: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Watch the whole population of interest

(affects the denominator) Effect of Loan Amount on Delinquency

0.0000%

1.0000%

2.0000%

3.0000%

4.0000%

5.0000%

6.0000%

7.0000%

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000

Loan Amount

De

lin

qu

en

cy

Rate

.

Loans in Collection

Original Loans in Collection

Page 22: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Where you live can affect your life span.

Connecticut ?

Page 23: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Where is Mortality Lowest? Mortality (Prob of Dying Next Yr) As % of U.S. Rate

Alaska............, 57.0

Connecticut......., 102.5

50.0

60.0

70.0

80.0

90.0

100.0

110.0

Reported Mortality Index

Ind

ex (

US

= 1

00)

Alaska............

Connecticut.......

Page 24: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Is Connecticut Mortality Above Avg? Connecticut Mortality (Prob of Dying Next Yr) As % of U.S. Rate

Reported Total

Under 1

1-4

5-14 15-24

25-34

35-44

45-54

55-64

65-74

75-84

85 years

50.0

60.0

70.0

80.0

90.0

100.0

110.0

Age Group

Ind

ex (

US

=100)

Connecticut.......

Page 25: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Compare with Alaska by Age Mortality (Prob of Dying Next Yr) As % of U.S. Rate

Under 1

1-4

5-14

15-24

25-34

35-44

45-54

55-64

65-7475-84

85 years

Under 1

1-4

5-14 15-24

25-34 35-4445-54 55-64

65-7475-84

85 years

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

160.0

180.0

200.0

Age Group

Ind

ex (

US

=1

00

)

Alaska............

Connecticut.......

Page 26: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Compare Observed with Age-Adjusted Mortality Rates by State as Percent of U.S.

40.0

60.0

80.0

100.0

120.0

140.0

160.0

180.0

50.0 60.0 70.0 80.0 90.0 100.0 110.0 120.0 130.0 140.0 150.0

Reported Mortality Index

Ag

e A

dju

ste

d M

ort

ality

In

dex

Series1

Linear (Series1)

Alaska

CT

Page 27: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Compare Rankings State Ranks Compared

0

13

26

39

52

0 13 26 39 52

Reported State Rank

Ad

jus

ted

Sta

te R

an

k

RankAdjusted

Linear (RankAdjusted)

Alaska

CT

Page 28: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

For full table identifying mortality rate in each state,

contact Cary S. Shaw & Associates, LLC

Page 29: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Riddles

Product Usage a predictor of retention

Optimum student loan amount

Effect of sales channel on retention

Longevity dependency on location

Page 30: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Lessons

Separate cause and effect by chronology.

Find the initial populations under study.

Distinguish between missing and zero values.

Search for significant lurking variables.

Creatively test your own ideas.

Page 31: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Why Sample?

Examine large databases

Rich on detail, economical on resources

Check own analytics

Cross check IT

Cross check vendors

May enable GREATER accuracy because

Enables correction of representative points

Enables understanding of representative points

Page 32: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Which sample is most

representative?

Beginning Middle

Page 33: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Which sample is most

representative?

Beginning Middle Random

Page 34: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Which sample is most

representative?

Beginning Middle Random Stratified Random

Page 35: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Stratified Random is making sure

the sample is representative.

Method A:

Divide data into strata (categories, deciles, etc.)

Pick target number from each strata

Method B:

Sort file by factors known to be important (such as

income, age, even ID number)

Pick every n th record, starting with record n/2

Page 36: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

What happens when you match

samples?

How big is the resulting sample?

Year 1 Year 2 Year 3

10% 10% 10% = 0.1%

Customer Product Sales$ Supplies A/R

10% 10% . 10% 10% 10%

= 0.001%

Page 37: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

There’s a way to take samples that

match even as the data is changing

(don’t lose by inner joins)

Page 38: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Examples of Record IDs

Social Security Number

Credit Card Number

Account ID

Develop algorithm to transform Record ID into

Random Number.

Then select from each file based on Random Number

range.

Samples from different files will then match.

Page 39: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Example: Duns File Update

How did NAICS code change?

Jan NAICS = April NAICS

Jan NAICS <> April NAICS

No January Record

No April Record

Page 40: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Example: Duns File Update

How did NAICS code change?

Jan NAICS = April NAICS 1006

Jan NAICS <> April NAICS 0

No January Record 0

No April Record 0

Page 41: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Example: Duns File Update

How did NAICS code change?

Jan NAICS = April NAICS 829

Jan NAICS <> April NAICS 107

No January Record 64

No April Record 70

Page 42: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

We covered:

Qualify

Sniff, Define, Frequency Check

Interpret Lurk, Miss, Population

Sample

Methods, Application

Summary Tools, Lessons

Page 43: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

Lessons

Perform the sniff test.

Check number of records.

Do frequency counts (highest, lowest).

Look for missing values (distinguish from zeros and

duplicates).

Investigate lurking interactive variables.

Analyze what happened to original populations, not

final populations.

Examine representative samples (stratified preferred).

Assume errors and look for them. (Apply G.I.V.O.).

Page 44: How to Recognize and Fix Data Faults That Can Make or ...nymetro.chapter.informs.org/prac_cor_pubs/01-2013 Cary...Analytics Consultant Analytics Recipient Cary S. Shaw & Associates,

Cary S. Shaw & Associates, LLC

- FIN -

Avoid the pitfalls of found

data and reach for the Stars

For further information contact:

Cary S. Shaw & Associates, LLC

(203)505-3180

[email protected]