ch2 data preprocessing part3 dr. bernard chen ph.d. university of central arkansas fall 2009

24
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Upload: byron-trow

Post on 14-Dec-2015

224 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Ch2 Data Preprocessing part3

Dr. Bernard Chen Ph.D.University of Central Arkansas

Fall 2009

Page 2: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Knowledge Discovery (KDD) Process

Data mining—core of knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Page 3: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Forms of Data Preprocessing

Page 4: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Data Transformation

Data transformation – the data are transformed or consolidated into forms appropriate for mining

Page 5: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Data Transformation

Data Transformation can involve the following: Smoothing: remove noise from the data,

including binning, regression and clustering

Aggregation Generalization Normalization Attribute construction

Page 6: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Normalization

Min-max normalization Z-score normalization Decimal normalization

Page 7: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Min-max normalization

Min-max normalization: to

[new_minA, new_maxA]

Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

716.00)00.1(000,12000,98

000,12600,73

Page 8: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Z-score normalization

Z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then

A

Avv

'

225.1000,16

000,54600,73

Page 9: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Decimal normalization

Normalization by decimal scaling

Suppose the recorded value of A range from -986 to 917, the max absolute value is 986, so j = 3

j

vv

10' Where j is the smallest integer such that Max(|ν’|) < 1

Page 10: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Data Reduction

Why data reduction? A database/data warehouse may

store terabytes of data Complex data analysis/mining may

take a very long time to run on the complete data set

Page 11: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Data Reduction

Data reduction Obtain a reduced representation of

the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results

Page 12: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Data Reduction Data reduction strategies

Data cube aggregation Attribute subset selection Dimensionality reduction — e.g.,

remove unimportant attributes Numerosity reduction — e.g., fit data

into models Discretization and concept hierarchy

generation

Page 13: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Data cube aggregation

Page 14: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Data cube aggregation

Multiple levels of aggregation in data cubes Further reduce the size of data to deal with

Reference appropriate levels Use the smallest representation which is

enough to solve the task

Page 15: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Attribute subset selectionDimensionality reduction Feature selection (i.e., attribute subset

selection):

Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features

reduce # of patterns in the patterns, easier to understand

Page 16: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Attribute subset selectionDimensionality reduction

Heuristic methods (due to exponential # of choices): Step-wise forward selection Step-wise backward elimination Combining forward selection and

backward elimination Decision-tree induction

Page 17: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Attribute subset selectionDimensionality reduction

Initial attribute set:{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Page 18: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Numerosity reduction

Reduce data volume by choosing alternative, smaller forms of data representation

Major families: histograms, clustering, sampling

Page 19: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Data Reduction Method: Histograms

0

5

10

15

20

25

30

35

40

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Page 20: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Data Reduction Method: Histograms

Divide data into buckets and store average (sum) for each bucket

Partitioning rules: Equal-width: equal bucket range Equal-frequency (or equal-depth) V-optimal: with the least histogram variance (weighted

sum of the original values that each bucket represents)

MaxDiff: set bucket boundary between each pair for pairs have the β–1 largest differences

Page 21: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Data Reduction Method: Clustering

Partition data set into clusters based on

similarity, and store cluster representation

(e.g., centroid and diameter) only

There are many choices of clustering

definitions and clustering algorithms

Cluster analysis will be studied in depth in

Chapter 7

Page 22: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Data Reduction Method: Sampling Sampling: obtaining a small sample s to

represent the whole data set N

Simple random sample without replacement

Simple random sample with replacement Cluster sample: if the tuples in D are grouped

into M mutually disjoint clusters, then an Simple Random Sample can be obtained, where s < M

Stratified sample

Page 23: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Sampling: with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Page 24: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample