file 498 doc 17 02dm datapreprocessing 2

15
1 ผผผผผผผผผผผผผผผผผผ ผผผผผผผ ผผผผผผผ ([email protected], 08-9275-9797) DATA DATA PREPROCESS PREPROCESS ING ING

Upload: mupa

Post on 07-Nov-2014

1.054 views

Category:

Technology


7 download

DESCRIPTION

 

TRANSCRIPT

Page 1: File 498 Doc 17 02dm Datapreprocessing 2

1

ผู้��ช่�วยศาสตราจารย�จ�ร�ฎฐา ภู�บุ�ญอบุ ([email protected],

08-9275-9797)

DATA DATA PREPROCEPREPROCESSINGSSING

Page 2: File 498 Doc 17 02dm Datapreprocessing 2

2

1. WHY DO WE NEED TO PREPROCESS THE DATA?

Fields that are obsolete or redundantMissing valuesOutliersData in a form not suitable for data mining modelsValues not consistent with policy or common sense

Page 3: File 498 Doc 17 02dm Datapreprocessing 2

3

2. DATA CLEANING Can You Find Any Problems in This Tiny Data Set?

Cust ID Zip

Gender

Income

Age

Marital

Status

Transaction

Amount

1001

10048

M 75000

C M 5000

1002

J2S7K7

F -40000

40 W 4000

1003

90210

1000000

0

45 S 7000

1004

6269

M 50000

0 S 1000

1005

55101

F 99999

30 D 3000

Page 4: File 498 Doc 17 02dm Datapreprocessing 2

4

3. The three main types of problem data Errors

A recording errorA typing errorA transcription errorAn inversionA repetitionA deliberate error

OutliersMissing observations

Page 5: File 498 Doc 17 02dm Datapreprocessing 2

5

4. HANDLING MISSING DATA

Some of our field values are missing!

Page 6: File 498 Doc 17 02dm Datapreprocessing 2

6

4. HANDLING MISSING DATA (cont’d)Insightful Miner offers a choice of

replacement values for missing data:

Replace the missing value with some constant, specified by the analyst.Replace the missing value with the field mean (for numerical variables) or the mode (for categorical variables).Replace the missing values with a value generated at random from the variable distribution observed.

Page 7: File 498 Doc 17 02dm Datapreprocessing 2

7

4. HANDLING MISSING DATA (cont’d) Replacing missing field values with user-defined constants

Page 8: File 498 Doc 17 02dm Datapreprocessing 2

8

4. HANDLING MISSING DATA (cont’d) Replacing missing field values with means or modes

Page 9: File 498 Doc 17 02dm Datapreprocessing 2

9

4. HANDLING MISSING DATA (cont’d) Replacing missing field values with random draws from the distribution of the variable

Page 10: File 498 Doc 17 02dm Datapreprocessing 2

10

5. GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS

Histogram of vehicle weights: can you find the outlier?

Page 11: File 498 Doc 17 02dm Datapreprocessing 2

11

5. GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS

Scatter plot of mpg against weightlbs shows two outliers

Page 12: File 498 Doc 17 02dm Datapreprocessing 2

12

6. DATA TRANSFORMATION

Histogram of time-to-60, with summary statistics

Page 13: File 498 Doc 17 02dm Datapreprocessing 2

13

6. DATA TRANSFORMATION (ต่�อ)

Min–Max Normalization

For the field minimum (8 seconds)

For the variable average (15.548 seconds)

For the variable maximum (25 seconds)

)min()max(

)min(*

XX

XXX

0825

88*

X

444.0825

8548.15*

X

0.1825

825*

X

min–max normalization

values will range from zero to one

Page 14: File 498 Doc 17 02dm Datapreprocessing 2

14

6. DATA TRANSFORMATION (ต่�อ)

Z-Score Standardization

For the field minimum (8 seconds)

For the variable average (15.548 seconds)

For the variable maximum (25 seconds)

)(

)(*

XSD

XmeanXX

593.2911.2

548.158*

X

Z-score standardization values will usually range between –4

and 40

911.2

548.15548.15*

X

247.3911.2

548.1525*

X

Page 15: File 498 Doc 17 02dm Datapreprocessing 2

15

6. DATA TRANSFORMATION (ต่�อ)Histogram of time-to-60 after Z-score standardization