data preparation part 1: exploratory data analysis & data

49
1 Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data CAS Predictive Modeling Seminar Louise Francis Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com [email protected]

Upload: others

Post on 29-Oct-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Preparation Part 1: Exploratory Data Analysis & Data

1

Data PreparationPart 1: Exploratory Data Analysis & Data Cleaning, Missing Data

CAS Predictive Modeling SeminarLouise Francis

Francis Analytics and Actuarial Data Mining, Inc.www.data-mines.com

[email protected]

Page 2: Data Preparation Part 1: Exploratory Data Analysis & Data

2

ObjectivesIntroduce data preparation and where it fits in in modeling processDiscuss Data QualityFocus on a key part of data preparation

Exploratory data analysisIdentify data glitches and errorsUnderstanding the dataIdentify possible transformations

What to do about missing dataProvide resources on data preparation

LF1

Page 3: Data Preparation Part 1: Exploratory Data Analysis & Data

Slide 2

LF1 Louise Francis, 9/29/2006

Page 4: Data Preparation Part 1: Exploratory Data Analysis & Data

3

CRISP-DM

Guidelines for data mining projectsGives overview of life cycle of data mining projectDefines different phases and activities that take place in phase

Page 5: Data Preparation Part 1: Exploratory Data Analysis & Data

4

Modelling Process

Modeling

Evaluation

Deployment

Data Preparation

Data Understanding

Business Understanding

00Complex Model Artifact

Page 6: Data Preparation Part 1: Exploratory Data Analysis & Data

5

Data Preprocessing

Page 7: Data Preparation Part 1: Exploratory Data Analysis & Data

6

Data Quality Problem

Page 8: Data Preparation Part 1: Exploratory Data Analysis & Data

7

Data Quality: A ProblemActuary reviewing a database

Page 9: Data Preparation Part 1: Exploratory Data Analysis & Data

8

May’s Law

Page 10: Data Preparation Part 1: Exploratory Data Analysis & Data

9

It’s Not Just Us

“In just about any organization, the state of information quality is at the same low level”

Olson, Data Quality

Page 11: Data Preparation Part 1: Exploratory Data Analysis & Data

10

Some Consequences of poor data quality

Affects quality (precision) of resultCan’t do modeling project because of data problemsIf errors not found – modeling blunder

Page 12: Data Preparation Part 1: Exploratory Data Analysis & Data

11

Data Exploration in Predictive Modeling

Page 13: Data Preparation Part 1: Exploratory Data Analysis & Data

12

Exploratory Data Analysis

Typically the first step in analyzing dataMakes heavy use of graphical techniquesAlso makes use of simple descriptive statisticsPurpose

Find outliers (and errors)Explore structure of the data

Page 14: Data Preparation Part 1: Exploratory Data Analysis & Data

13

Definition of EDA

Exploratory data analysis (EDA) is that part of statistical practiceconcerned with reviewing, communicating and using data where there is a low level of knowledge about its cause system.. Many EDAtechniques have been adopted into data mining and are being taught to young students as a way to introduce them to statistical thinking.

- www.wikipedia.org

Page 15: Data Preparation Part 1: Exploratory Data Analysis & Data

14

Example Data

Private passenger autoSome variables are:

AgeGenderMarital statusZip codeEarned premiumNumber of claimsIncurred lossesPaid losses

Page 16: Data Preparation Part 1: Exploratory Data Analysis & Data

15

Some Methods for Numeric Data

VisualHistogramsBox and Whisker PlotsStem and Leaf Plots

StatisticalDescriptive statisticsData spheres

Page 17: Data Preparation Part 1: Exploratory Data Analysis & Data

16

Histograms

Can do them in Microsoft Excel

Page 18: Data Preparation Part 1: Exploratory Data Analysis & Data

17

HistogramsFrequencies for Age Variable

Bin Frequency20 285325 370930 437235 436640 409745 358850 270755 183160 114065 61570 39775 27180 14885 8390 3295 12

More 5

Page 19: Data Preparation Part 1: Exploratory Data Analysis & Data

18

Histograms of Age VariableVarying Window Size

16 33 51 68 85Age

0

20

40

60

80

16 23 30 37 44 51 57 64 71 78 85

Age

0

5

10

15

20

16 24 31 39 47 54 62 70 77 85 Age0

10

20

30

40

Page 20: Data Preparation Part 1: Exploratory Data Analysis & Data

19

Formula for Window Width

13

3.5

standard deviationN=sample sizeh =window width

hN

σ

σ

=

=

Page 21: Data Preparation Part 1: Exploratory Data Analysis & Data

20

Example of Suspicious Value

600 900 1200 1500 1800

License Year

0

5,000

10,000

15,000

20,000

25,000Fr

eque

ncy

Page 22: Data Preparation Part 1: Exploratory Data Analysis & Data

21

Discrete-Numeric Data

0 50,000 100,000Paid Losses

0

10,000

20,000

30,000

40,000

Freq

uenc

y

Page 23: Data Preparation Part 1: Exploratory Data Analysis & Data

22

Filtered DataFilter out Unwanted Records

0 20,000 40,000 60,000 80,000 100,000Paid Losses

0

200

400

600

800

1,000

1,200

Freq

uenc

y

Page 24: Data Preparation Part 1: Exploratory Data Analysis & Data

23

Box Plot Basics:Five – Point Summary

Minimum1st quartileMedian2nd quartileMaximum

Page 25: Data Preparation Part 1: Exploratory Data Analysis & Data

24

Functions for five point summary

=min(data range)=quartile(data range1)=median(data range)=quartile(data range,3)=max(data range)

Page 26: Data Preparation Part 1: Exploratory Data Analysis & Data

25

Box and Whisker Plot

-20

0

20

40

60

80

Nor

mal

ly D

istri

bute

d A

ge

+1.5*midspread

median

Outliers

75th Percentile

25th Percentile

Page 27: Data Preparation Part 1: Exploratory Data Analysis & Data

26

Plot of Heavy Tailed DataPaid Losses

-10000

10000

30000

50000

70000

90000

110000

Paid

Los

s

Page 28: Data Preparation Part 1: Exploratory Data Analysis & Data

27

Heavy Tailed Data – Log Scale

101

102

103

104

105

106

107Pa

id L

oss

Page 29: Data Preparation Part 1: Exploratory Data Analysis & Data

28

Box and Whisker Example

16 1820 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 96

Age

0

200

400

600

800

1,000

Freq

uenc

y

Page 30: Data Preparation Part 1: Exploratory Data Analysis & Data

29

Descriptive StatisticsAnalysis ToolPak

Statistic Policyholder Age

Mean 36.9Standard Error 0.1Median 35.0Mode 32.0Standard Deviation 13.2Sample Variance 174.4Kurtosis 0.5Skewness 0.7Range 84Minimum 16Maximum 100Sum 1114357Count 30226Largest(2) 100Smallest(2) 16

Page 31: Data Preparation Part 1: Exploratory Data Analysis & Data

30

Descriptive Statistics

Claimant age has minimum and maximums that are impossible

N Minimum Maximum Mean Std.

Deviation

License Year 30,250 490 2,049 1,990 16.3

Valid N 30,250

Page 32: Data Preparation Part 1: Exploratory Data Analysis & Data

31

Data Spheres: The Mahalanobis Distance Statistic

1( ) ' ( ) is a vector of variables is a vector of means is a variance-covariance matrix

−= − −MD x μ Σ x μxμΣ

Page 33: Data Preparation Part 1: Exploratory Data Analysis & Data

32

Screening Many Variables at Once

Plot of Longitude and Latitude of zip codes in dataExamination of outliers indicated drivers in Ca and PR even though policies only in one mid-Atlantic state

1( ) ' ( )−= − −MD x μ Σ x μ

Page 34: Data Preparation Part 1: Exploratory Data Analysis & Data

33

Records With Unusual Values Flagged

Policy ID Mahalanobis

Depth Percentile of Mahalanobis Age

License Year

Number of Cars

Number of Drivers

Model Year

Incurred Loss

22244 59 100 27 1997 3 6 1994 4,4566159 60 100 22 2001 2 6 1993 0

22997 65 100 NA NA 2 1 1954 05412 61 100 17 2003 3 6 1994 0

30577 72 100 43 1979 3 1 1952 028319 8,490 100 30 490 1 1 1987 027815 55 100 44 1976 -1 0 1959 016158 24 100 82 1938 1 1 1989 61,1874908 25 100 56 1997 4 4 2003 35,697

28790 24 100 82 2039 1 1 1985 27,769

Page 35: Data Preparation Part 1: Exploratory Data Analysis & Data

34

Categorical Data: Data Cubes

Page 36: Data Preparation Part 1: Exploratory Data Analysis & Data

35

Categorical Data

Data CubesUsually frequency tablesSearch for missing values coded as blanks

Gender

Frequency Percent 5,054 14.3F 13,032 36.9M 17,198 48.7Total 35,284 100

Page 37: Data Preparation Part 1: Exploratory Data Analysis & Data

36

Categorical Data

Table highlights inconsistent coding of marital status

Marital Status

Frequency Percent 5,053 14.31 2,043 5.82 9,657 27.44 2 0D 4 0M 2,971 8.4S 15,554 44.1Total 35,284 100

Page 38: Data Preparation Part 1: Exploratory Data Analysis & Data

37

Missing Data

Page 39: Data Preparation Part 1: Exploratory Data Analysis & Data

38

Screening for Missing Data

BUSINESS

TYPE Gender Age License YearValid 35,284 35,284 30,242 30,250

N Missing 0 0 5,042 5,034

25 27.00 1,986.00

50 35.00 1,996.00Percentiles

75 45.00 2,000.00

Page 40: Data Preparation Part 1: Exploratory Data Analysis & Data

39

Blanks as Missing

Frequency Percent Valid PercentCumulative

Percent

5,054 14.3 14.3 14.3

F 13,032 36.9 36.9 51.3

M 17,198 48.7 48.7 100.0Valid

Total 35,284 100.0 100.0

Page 41: Data Preparation Part 1: Exploratory Data Analysis & Data

40

Types of Missing Values

Missing completely at randomMissing at randomInformative missing

Page 42: Data Preparation Part 1: Exploratory Data Analysis & Data

41

Methods for Missing Values

Drop record if any variable used in model is missingDrop variableData ImputationOther

CART, MARS use surrogate variablesExpectation Maximization

Page 43: Data Preparation Part 1: Exploratory Data Analysis & Data

42

Imputation

A method to “fill in” missing valueUse other variables (which have values) to predict value on missing variableInvolves building a model for variable with missing value

Y = f(x1,x2,…xn)

Page 44: Data Preparation Part 1: Exploratory Data Analysis & Data

43

Example: Age Variable

About 14% of records missing valuesImputation will be illustrated with simple regression model

Age = a+b1X1+b2X2…bnXn

Page 45: Data Preparation Part 1: Exploratory Data Analysis & Data

44

Model for Age

Tests of Between-Subjects Effects Dependent Variable: Age

Type III Sum of Squares df Mean Square F Sig. Source Corrected Model 3,218,216 24 134,092 1,971.2 0.000 Intercept 9,255 1 9,255 136.0 0.000 ClassCode 3,198,903 18 177,717 2,612.4 0.000 CoverageType 876 3 292 4.3 0.005 ModelYear 7,245 1 7,245 106.5 0.000 No of Vehicles 2,365 1 2,365 34.8 0.000 No of drivers 3,261 1 3,261 47.9 0.000 Error 2,055,243 30,212 68 Total 46,377,824 30,237 Corrected Total 5,273,459 30,236

Page 46: Data Preparation Part 1: Exploratory Data Analysis & Data

45

Missing Values

A problem for many traditional statistical models

Elimination of records missing on anything from analysis

Many data mining procedures have techniques built in for handling missing valuesIf too many records missing on a given variable, probably need to discard variable

Page 47: Data Preparation Part 1: Exploratory Data Analysis & Data

46

Metadata

Page 48: Data Preparation Part 1: Exploratory Data Analysis & Data

47

Metadata

Data about dataA reference that can be used in future modeling projects

Detailed description of the variables in the file, their meaning and permissible values

Marital Status Value Description 1 Married, data from source 1 2 Single, data from source 1 4 Divorced, data from source 1D Divorced, data from source 2M Married, data from source 2 S Single, data from source 2 Blank Marital status is missing

Page 49: Data Preparation Part 1: Exploratory Data Analysis & Data

48

Library for Getting Started

Dasu and Johnson, Exploratory Data Mining and Data Cleaning, Wiley, 2003Francis, L.A., “Dancing with Dirty Data: Methods for Exploring and Claeaning Data”, CAS Winter Forum, March 2005, www.casact.orgFind a comprehensive book for doing analysis in Excel such as: John Walkebach, Excel 2003 Formulas or Jospeh Schmuller, Statistical Analysis With Excel for DummiesIf you use R, get a book like: Fox, John, An R and S-PLUS Companion to Applied Regression, Sage Publications, 2002