Download - Data Preprocessing
http://dblab.chungbuk.ac.kr
Introduction to Data MiningIntroduction to Data Mining
Ch. 2 Data Ch. 2 Data PreprocessingPreprocessing
Heon Gyu LeeHeon Gyu Lee(([email protected]@dblab.chungbuk.ac.kr))
http://dblab.chungbuk.ac.kr/~hgleehttp://dblab.chungbuk.ac.kr/~hgleeDB/Bioinfo., Lab. DB/Bioinfo., Lab. http://dblab.chungbuk.ac.krhttp://dblab.chungbuk.ac.kr
Chungbuk National UniversityChungbuk National University
Why Data Preprocessing?Why Data Preprocessing?
Data in the real world is dirtyData in the real world is dirty– incompleteincomplete: lacking attribute values, lacking : lacking attribute values, lacking
certain attributes of interest, or containing only certain attributes of interest, or containing only aggregate dataaggregate data e.g., occupation=“ ”e.g., occupation=“ ”
– noisynoisy: containing errors or outliers: containing errors or outliers e.g., Salary=“-10”e.g., Salary=“-10”
– inconsistentinconsistent: containing discrepancies in codes : containing discrepancies in codes or namesor names e.g., Age=“42” Birthday=“03/07/1997”e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C”e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate recordse.g., discrepancy between duplicate records
What is Data?What is Data?
Collection of data objects and Collection of data objects and their attributestheir attributes
An attribute is a property or An attribute is a property or characteristic of an objectcharacteristic of an object– Examples: eye color of a Examples: eye color of a
person, temperature, etc.person, temperature, etc.– Attribute is also known as Attribute is also known as
variable, field, characteristic, variable, field, characteristic, or featureor feature
A collection of attributes A collection of attributes describe an objectdescribe an object– Object is also known as Object is also known as
record, point, case, sample, record, point, case, sample, entity, or instanceentity, or instance
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Attributes
Ob
ject
s
Types of Attributes Types of Attributes
There are different types of attributesThere are different types of attributes– NominalNominal
Examples: ID numbers, eye color, zip codesExamples: ID numbers, eye color, zip codes– OrdinalOrdinal
Examples: rankings (e.g., taste of potato chips on a Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, scale from 1-10), grades, height in {tall, medium, short}short}
– IntervalInterval Examples: calendar dates, temperatures in Celsius or Examples: calendar dates, temperatures in Celsius or
– RatioRatio Examples: temperature, length, time, counts Examples: temperature, length, time, counts
Discrete and Continuous Discrete and Continuous Attributes Attributes
Discrete AttributeDiscrete Attribute– Has only a finite or countably infinite set of valuesHas only a finite or countably infinite set of values– Examples: zip codes, counts, or the set of words in a collection of documeExamples: zip codes, counts, or the set of words in a collection of docume
nts nts – Often represented as integer variables. Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes Note: binary attributes are a special case of discrete attributes
Continuous AttributeContinuous Attribute– Has real numbers as attribute valuesHas real numbers as attribute values– Examples: temperature, height, or weight. Examples: temperature, height, or weight. – Practically, real values can only be measured and represented using a finitPractically, real values can only be measured and represented using a finit
e number of digits.e number of digits.– Continuous attributes are typically represented as floating-point variables. Continuous attributes are typically represented as floating-point variables.
Data Quality Data Quality
What kinds of data quality problems?What kinds of data quality problems? How can we detect problems with the data? How can we detect problems with the data? What can we do about these problems? What can we do about these problems?
Examples of data quality problems: Examples of data quality problems: – Noise and outliers Noise and outliers – missing values missing values – duplicate data duplicate data
NoiseNoise
Noise refers to modification of original valuesNoise refers to modification of original values– Examples: distortion of a person’s voice when talking on a Examples: distortion of a person’s voice when talking on a
poor phone and “snow” on television screenpoor phone and “snow” on television screen
Two Sine Waves Two Sine Waves + Noise
OutliersOutliers
Outliers are data objects with Outliers are data objects with characteristics that characteristics that are considerably differentare considerably different than most of the other than most of the other data objects in the data setdata objects in the data set
Missing ValuesMissing Values
Reasons for missing valuesReasons for missing values– Information is not collected Information is not collected
(e.g., people decline to give their age and weight)(e.g., people decline to give their age and weight)– Attributes may not be applicable to all cases Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)(e.g., annual income is not applicable to children)
Handling missing valuesHandling missing values– Eliminate Data ObjectsEliminate Data Objects– Estimate Missing ValuesEstimate Missing Values– Ignore the Missing Value During AnalysisIgnore the Missing Value During Analysis– Replace with all possible values (weighted by their Replace with all possible values (weighted by their
probabilities)probabilities)
Duplicate DataDuplicate Data
Data set may include data objects that are duplicates, or alData set may include data objects that are duplicates, or almost duplicates of one anothermost duplicates of one another– Major issue when merging data from heterogeous sourcesMajor issue when merging data from heterogeous sources
Examples:Examples:– Same person with multiple email addressesSame person with multiple email addresses
Data cleaningData cleaning– Process of dealing with duplicate data issuesProcess of dealing with duplicate data issues
Major Tasks in Data Major Tasks in Data PreprocessingPreprocessing
Data cleaningData cleaning– Fill in missing values, smooth noisy data, identify or remove outlieFill in missing values, smooth noisy data, identify or remove outlie
rs, and resolve inconsistenciesrs, and resolve inconsistencies
Data integrationData integration– Integration of multiple databases, data cubes, or filesIntegration of multiple databases, data cubes, or files
Data transformationData transformation– Normalization and aggregationNormalization and aggregation
Data reductionData reduction– Obtains reduced representation in volume but produces the same Obtains reduced representation in volume but produces the same
or similar analytical resultsor similar analytical results
Data discretizationData discretization– Part of data reduction but with particular importance, especially foPart of data reduction but with particular importance, especially fo
r numerical datar numerical data
Forms of Data PreprocessingForms of Data Preprocessing
ImportanceImportance– ““Data cleaning is one of the three biggest Data cleaning is one of the three biggest
problems in data warehousing”—Ralph Kimballproblems in data warehousing”—Ralph Kimball
– ““Data cleaning is the number one problem in Data cleaning is the number one problem in data warehousing”—DCI surveydata warehousing”—DCI survey
Data cleaning tasksData cleaning tasks– Fill in Fill in missing missing valuesvalues
– Identify outliers and smooth out Identify outliers and smooth out noisynoisy data data
– Correct inconsistent dataCorrect inconsistent data
– Resolve redundancy caused by data integrationResolve redundancy caused by data integration
Data CleaningData Cleaning
Data Cleaning Data Cleaning : How to Handle Missing Data?: How to Handle Missing Data?
Ignore the tupleIgnore the tuple: usually done when class label is missing : usually done when class label is missing (assuming the tasks in classification—not effective when (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies consthe percentage of missing values per attribute varies considerably.iderably.
Fill in the missing value manuallyFill in the missing value manually Fill in it automatically withFill in it automatically with
– a global constant : e.g., “unknown”, a new class?! a global constant : e.g., “unknown”, a new class?! – the attribute meanthe attribute mean– the attribute mean for all samples belonging to the same class: sthe attribute mean for all samples belonging to the same class: s
martermarter– the most probable value: the most probable value: inference-based such as Bayesian formuinference-based such as Bayesian formu
la or regressionla or regression
Data Cleaning Data Cleaning : How to Handle Noisy Data?: How to Handle Noisy Data?
BinningBinning– first sort data and partition into (equal-frequency) binsfirst sort data and partition into (equal-frequency) bins– then one can then one can smooth by bin means, smooth by bin smooth by bin means, smooth by bin
median, smooth by bin boundariesmedian, smooth by bin boundaries, etc., etc.
RegressionRegression– smooth by fitting the data into regression functionssmooth by fitting the data into regression functions
ClusteringClustering– detect and remove outliersdetect and remove outliers
Combined computer and human inspectionCombined computer and human inspection– detect suspicious values and check by human (e.g., deal detect suspicious values and check by human (e.g., deal
with possible outliers)with possible outliers)
Data Cleaning Data Cleaning : Binning Methods: Binning Methods
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 3434
* Partition into equal-frequency (equi-depth) bins:* Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15- Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25- Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34- Bin 3: 26, 28, 29, 34* Smoothing by bin means:* Smoothing by bin means: - Bin 1: 9, 9, 9, 9- Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23- Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29- Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries:* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15- Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25- Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34- Bin 3: 26, 26, 26, 34
Data Cleaning : Data Cleaning : RegressionRegression
x
y
y = x + 1
X1
Y1
Y1’
Data Cleaning : Cluster Data Cleaning : Cluster AnalysisAnalysis
Data IntegrationData Integration
Data integration: Data integration: – Combines data from Combines data from multiple sources into a coherent storemultiple sources into a coherent store
Schema integration: e.g., A.cust-id Schema integration: e.g., A.cust-id B.cust-# B.cust-#– Integrate metadata from different sourcesIntegrate metadata from different sources
Entity identification problem:Entity identification problem: – Identify real world entities from multiple data sources, e.g., Bill ClinIdentify real world entities from multiple data sources, e.g., Bill Clin
ton = William Clintonton = William Clinton Detecting and resolving data value conflictsDetecting and resolving data value conflicts
– For the same real world entity, attribute values from different sourcFor the same real world entity, attribute values from different sources are differentes are different
– Possible reasons: different representations, different scalesPossible reasons: different representations, different scales
Data Integration Data Integration : Handling Redundancy in Data : Handling Redundancy in Data
IntegrationIntegration
Redundant data occur often when integration of Redundant data occur often when integration of multiple databasesmultiple databases– Object identificationObject identification: The same attribute or object may : The same attribute or object may
have different names in different databaseshave different names in different databases
– Derivable data:Derivable data: One attribute may be a “derived” One attribute may be a “derived” attribute in another table, e.g., annual revenueattribute in another table, e.g., annual revenue
Redundant attributes may be able to be detected Redundant attributes may be able to be detected by by correlation analysiscorrelation analysis
Careful integration of the data from multiple Careful integration of the data from multiple sources may help reduce/avoid redundancies and sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and inconsistencies and improve mining speed and qualityquality
Data Integration : Data Integration : Correlation Analysis (Numerical Data)Correlation Analysis (Numerical Data)
Correlation coefficient (also called Correlation coefficient (also called Pearson’s product momePearson’s product moment coefficientnt coefficient))
where n is the number of tuples, and are the respective means of A anwhere n is the number of tuples, and are the respective means of A and B, d B, σσA A and and σσB B are the respective standard deviation of A and B, and are the respective standard deviation of A and B, and ΣΣ(AB) i(AB) is the sum of the AB cross-product.s the sum of the AB cross-product.
If rIf rA,BA,B > 0, A and B are positively correlated (A’s values incre > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation.ase as B’s). The higher, the stronger correlation.
rrA,BA,B = 0: independent; r = 0: independent; rA,BA,B < 0: negatively correlated < 0: negatively correlated
BABA n
BAnAB
n
BBAAr BA )1(
)(
)1(
))((,
BA
Data Integration Data Integration : Correlation Analysis (Categorical : Correlation Analysis (Categorical
Data)Data) ΧΧ22 (chi-square) test (chi-square) test
The larger the The larger the ΧΧ22 value, the more likely the variables are rel value, the more likely the variables are relatedated
The cells that contribute the most to the The cells that contribute the most to the ΧΧ22 value are thos value are those whose actual count is very different from the expected coe whose actual count is very different from the expected countunt
Correlation does not imply causalityCorrelation does not imply causality– # of hospitals and # of car-theft in a city are correlated# of hospitals and # of car-theft in a city are correlated– Both are causally linked to the third variable: populationBoth are causally linked to the third variable: population
Expected
ExpectedObserved 22 )(
Chi-Square Calculation: An ExampleChi-Square Calculation: An Example
ΧΧ22 (chi-square) calculation (numbers in parenthesis are expected count (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories)s calculated based on the data distribution in the two categories)
It shows that like_science_fiction and play_chess are correlated in the gIt shows that like_science_fiction and play_chess are correlated in the grouproup
93.507840
)8401000(
360
)360200(
210
)21050(
90
)90250( 22222
Play chessPlay chess Not play chessNot play chess Sum (row)Sum (row)
Like science fictionLike science fiction 250(90)250(90) 200(360)200(360) 450450
Not like science fictionNot like science fiction 50(210)50(210) 1000(840)1000(840) 10501050
Sum(col.)Sum(col.) 300300 12001200 15001500
Data TransformationData Transformation
Smoothing: remove noise from dataSmoothing: remove noise from data
Aggregation: summarization, data cube Aggregation: summarization, data cube constructionconstruction
Generalization: concept hierarchy climbingGeneralization: concept hierarchy climbing
Normalization: scaled to fall within a small, Normalization: scaled to fall within a small, specified rangespecified range– min-max normalizationmin-max normalization
– z-score normalizationz-score normalization
– normalization by decimal scalingnormalization by decimal scaling
Attribute/feature constructionAttribute/feature construction– New attributes constructed from the given onesNew attributes constructed from the given ones
Data TransformationData Transformation: Normalization: Normalization
Min-max normalization: to [new_minMin-max normalization: to [new_minAA, new_max, new_maxAA]]
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. TheEx. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to n $73,000 is mapped to
Z-score normalization (Z-score normalization (μμ: mean, : mean, σσ: standard deviation):: standard deviation):
– Ex. Let Ex. Let μμ = 54,000, = 54,000, σσ = 16,000. Then = 16,000. Then
Normalization by decimal scalingNormalization by decimal scaling
716.00)00.1(000,12000,98
000,12600,73
AAA
AA
A
minnewminnewmaxnewminmax
minvv _)__('
A
Avv
'
j
vv
10' Where j is the smallest integer such that Max(|ν’|) < 1
225.1000,16
000,54600,73
Data Reduction StrategiesData Reduction Strategies
Why data reduction?Why data reduction?– A database/data warehouse may store terabytes of dataA database/data warehouse may store terabytes of data– Complex data analysis/mining may take a very long time to run on the cComplex data analysis/mining may take a very long time to run on the c
omplete data setomplete data set
Data reduction Data reduction – Obtain a reduced representation of the data set that is much smaller in Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical resultsvolume but yet produce the same (or almost the same) analytical results
Data reduction strategiesData reduction strategies– AggregationAggregation– SamplingSampling– Dimensionality ReductionDimensionality Reduction– Feature subset selectionFeature subset selection– Feature creationFeature creation– Discretization and BinarizationDiscretization and Binarization– Attribute TransformationAttribute Transformation
Data Reduction : AggregationData Reduction : Aggregation
Combining two or more attributes (or objects) into Combining two or more attributes (or objects) into a single attribute (or object)a single attribute (or object)
PurposePurpose– Data reductionData reduction
Reduce the number of attributes or objectsReduce the number of attributes or objects– Change of scaleChange of scale
Cities aggregated into regions, states, countries, etcCities aggregated into regions, states, countries, etc– More “stable” dataMore “stable” data
Aggregated data tends to have less variability Aggregated data tends to have less variability
Data Reduction : AggregationData Reduction : Aggregation
Standard Deviation of Average Monthly Precipitation
Standard Deviation of Average Yearly Precipitation
Variation of Precipitation in Australia
Data Reduction : Sampling Data Reduction : Sampling
Sampling is the main technique employed for data selection.Sampling is the main technique employed for data selection.– It is often used for both the preliminary investigation of the data It is often used for both the preliminary investigation of the data
and the final data analysis.and the final data analysis.
Statisticians sample because Statisticians sample because obtainingobtaining the entire set of data the entire set of data of interest is too expensive or time consuming.of interest is too expensive or time consuming.
Sampling is used in data mining because Sampling is used in data mining because processing processing the the
entire set of data of interest is too expensive or time entire set of data of interest is too expensive or time consuming.consuming.
Data Reduction : Types of Data Reduction : Types of SamplingSampling
Simple Random SamplingSimple Random Sampling– There is an equal probability of selecting any particular There is an equal probability of selecting any particular
itemitem
Sampling without replacementSampling without replacement– As each item is selected, it is removed from the As each item is selected, it is removed from the
populationpopulation
Sampling with replacementSampling with replacement– Objects are not removed from the population as they are Objects are not removed from the population as they are
selected for the sample. selected for the sample. In sampling with replacement, the same object can be In sampling with replacement, the same object can be
picked up more than oncepicked up more than once
Data Reduction Data Reduction : Dimensionality Reduction: Dimensionality Reduction
Purpose:Purpose:– Avoid curse of dimensionalityAvoid curse of dimensionality– Reduce amount of time and memory required by data Reduce amount of time and memory required by data
mining algorithmsmining algorithms– Allow data to be more easily visualizedAllow data to be more easily visualized– May help to eliminate irrelevant features or reduce noiseMay help to eliminate irrelevant features or reduce noise
TechniquesTechniques– Principle Component AnalysisPrinciple Component Analysis– Singular Value DecompositionSingular Value Decomposition– Others: supervised and non-linear techniquesOthers: supervised and non-linear techniques
Dimensionality Reduction : PCADimensionality Reduction : PCA
Goal is to find a projection that captures the largest Goal is to find a projection that captures the largest amount of variation in data amount of variation in data
x2
x1
e
Dimensionality Reduction : PCADimensionality Reduction : PCA
Find the eigenvectors of the covariance matrixFind the eigenvectors of the covariance matrix The eigenvectors define the new spaceThe eigenvectors define the new space
x2
x1
e
Data Reduction Data Reduction : Feature Subset Selection: Feature Subset Selection
Another way to reduce dimensionality of dataAnother way to reduce dimensionality of data
Redundant features Redundant features – duplicate much or all of the information contained in one duplicate much or all of the information contained in one
or more other attributesor more other attributes– Example: purchase price of a product and the amount of Example: purchase price of a product and the amount of
sales tax paidsales tax paid
Irrelevant featuresIrrelevant features– contain no information that is useful for the data mining contain no information that is useful for the data mining
task at handtask at hand– Example: students' ID is often irrelevant to the task of Example: students' ID is often irrelevant to the task of
predicting students' GPApredicting students' GPA
Data Reduction Data Reduction : Feature Subset Selection: Feature Subset Selection
Techniques:Techniques:– Brute-force approch:Brute-force approch:
Try all possible feature subsets as input to data mining algorithmTry all possible feature subsets as input to data mining algorithm– Filter approaches:Filter approaches:
Features are selected before data mining algorithm is runFeatures are selected before data mining algorithm is run– Wrapper approaches:Wrapper approaches:
Use the data mining algorithm as a black box to find best subset Use the data mining algorithm as a black box to find best subset of attributesof attributes
Data Reduction Data Reduction : Feature Creation: Feature Creation
Create new attributes that can capture the Create new attributes that can capture the important information in a data set much more important information in a data set much more efficiently than the original attributesefficiently than the original attributes
Three general methodologies:Three general methodologies:– Feature ExtractionFeature Extraction
domain-specificdomain-specific– Mapping Data to New SpaceMapping Data to New Space– Feature ConstructionFeature Construction
combining features combining features
Data Reduction Data Reduction : Mapping Data to a New Space: Mapping Data to a New Space
Two Sine Waves Two Sine Waves + Noise Frequency
Fourier transformFourier transform Wavelet transform Wavelet transform
Data Reduction Data Reduction : Discretization Using Class Labels: Discretization Using Class Labels
Entropy based approachEntropy based approach
3 categories for both x and y 5 categories for both x and y
Data Reduction Data Reduction : Discretization Without Using Class Labels: Discretization Without Using Class Labels
Data Equal interval width
Equal frequency K-means
Data Reduction Data Reduction : Attribute Transformation: Attribute Transformation
A function that maps the entire set of values of a given attA function that maps the entire set of values of a given attribute to a new set of replacement values such that each ribute to a new set of replacement values such that each old value can be identified with one of the new valuesold value can be identified with one of the new values– Simple functions: xSimple functions: xkk, log(x), e, log(x), exx, |x|, |x|– Standardization and Normalization Standardization and Normalization