file 498 doc 17 02dm datapreprocessing 2
DESCRIPTION
TRANSCRIPT
1
ผู้��ช่�วยศาสตราจารย�จ�ร�ฎฐา ภู�บุ�ญอบุ ([email protected],
08-9275-9797)
DATA DATA PREPROCEPREPROCESSINGSSING
2
1. WHY DO WE NEED TO PREPROCESS THE DATA?
Fields that are obsolete or redundantMissing valuesOutliersData in a form not suitable for data mining modelsValues not consistent with policy or common sense
3
2. DATA CLEANING Can You Find Any Problems in This Tiny Data Set?
Cust ID Zip
Gender
Income
Age
Marital
Status
Transaction
Amount
1001
10048
M 75000
C M 5000
1002
J2S7K7
F -40000
40 W 4000
1003
90210
1000000
0
45 S 7000
1004
6269
M 50000
0 S 1000
1005
55101
F 99999
30 D 3000
4
3. The three main types of problem data Errors
A recording errorA typing errorA transcription errorAn inversionA repetitionA deliberate error
OutliersMissing observations
5
4. HANDLING MISSING DATA
Some of our field values are missing!
6
4. HANDLING MISSING DATA (cont’d)Insightful Miner offers a choice of
replacement values for missing data:
Replace the missing value with some constant, specified by the analyst.Replace the missing value with the field mean (for numerical variables) or the mode (for categorical variables).Replace the missing values with a value generated at random from the variable distribution observed.
7
4. HANDLING MISSING DATA (cont’d) Replacing missing field values with user-defined constants
8
4. HANDLING MISSING DATA (cont’d) Replacing missing field values with means or modes
9
4. HANDLING MISSING DATA (cont’d) Replacing missing field values with random draws from the distribution of the variable
10
5. GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS
Histogram of vehicle weights: can you find the outlier?
11
5. GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS
Scatter plot of mpg against weightlbs shows two outliers
12
6. DATA TRANSFORMATION
Histogram of time-to-60, with summary statistics
13
6. DATA TRANSFORMATION (ต่�อ)
Min–Max Normalization
For the field minimum (8 seconds)
For the variable average (15.548 seconds)
For the variable maximum (25 seconds)
)min()max(
)min(*
XX
XXX
0825
88*
X
444.0825
8548.15*
X
0.1825
825*
X
min–max normalization
values will range from zero to one
14
6. DATA TRANSFORMATION (ต่�อ)
Z-Score Standardization
For the field minimum (8 seconds)
For the variable average (15.548 seconds)
For the variable maximum (25 seconds)
)(
)(*
XSD
XmeanXX
593.2911.2
548.158*
X
Z-score standardization values will usually range between –4
and 40
911.2
548.15548.15*
X
247.3911.2
548.1525*
X
15
6. DATA TRANSFORMATION (ต่�อ)Histogram of time-to-60 after Z-score standardization