file 498 doc 17 02dm datapreprocessing 2

Post on 07-Nov-2014

1.054 Views

Category:

Technology

7 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

1

ผู้��ช่�วยศาสตราจารย�จ�ร�ฎฐา ภู�บุ�ญอบุ (jiratta.p@msu.ac.th,

08-9275-9797)

DATA DATA PREPROCEPREPROCESSINGSSING

2

1. WHY DO WE NEED TO PREPROCESS THE DATA?

Fields that are obsolete or redundantMissing valuesOutliersData in a form not suitable for data mining modelsValues not consistent with policy or common sense

3

2. DATA CLEANING Can You Find Any Problems in This Tiny Data Set?

Cust ID Zip

Gender

Income

Age

Marital

Status

Transaction

Amount

1001

10048

M 75000

C M 5000

1002

J2S7K7

F -40000

40 W 4000

1003

90210

1000000

0

45 S 7000

1004

6269

M 50000

0 S 1000

1005

55101

F 99999

30 D 3000

4

3. The three main types of problem data Errors

A recording errorA typing errorA transcription errorAn inversionA repetitionA deliberate error

OutliersMissing observations

5

4. HANDLING MISSING DATA

Some of our field values are missing!

6

4. HANDLING MISSING DATA (cont’d)Insightful Miner offers a choice of

replacement values for missing data:

Replace the missing value with some constant, specified by the analyst.Replace the missing value with the field mean (for numerical variables) or the mode (for categorical variables).Replace the missing values with a value generated at random from the variable distribution observed.

7

4. HANDLING MISSING DATA (cont’d) Replacing missing field values with user-defined constants

8

4. HANDLING MISSING DATA (cont’d) Replacing missing field values with means or modes

9

4. HANDLING MISSING DATA (cont’d) Replacing missing field values with random draws from the distribution of the variable

10

5. GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS

Histogram of vehicle weights: can you find the outlier?

11

5. GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS

Scatter plot of mpg against weightlbs shows two outliers

12

6. DATA TRANSFORMATION

Histogram of time-to-60, with summary statistics

13

6. DATA TRANSFORMATION (ต่�อ)

Min–Max Normalization

For the field minimum (8 seconds)

For the variable average (15.548 seconds)

For the variable maximum (25 seconds)

)min()max(

)min(*

XX

XXX

0825

88*

X

444.0825

8548.15*

X

0.1825

825*

X

min–max normalization

values will range from zero to one

14

6. DATA TRANSFORMATION (ต่�อ)

Z-Score Standardization

For the field minimum (8 seconds)

For the variable average (15.548 seconds)

For the variable maximum (25 seconds)

)(

)(*

XSD

XmeanXX

593.2911.2

548.158*

X

Z-score standardization values will usually range between –4

and 40

911.2

548.15548.15*

X

247.3911.2

548.1525*

X

15

6. DATA TRANSFORMATION (ต่�อ)Histogram of time-to-60 after Z-score standardization

top related