data mining techniques and applications, 1 st edition hongbo du isbn 978-1-84480-891-5 © 2010...

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Chapter Three

Data, pre-processing and exploration



Chapter Overview

• Data, data types and operations• Properties of various data sets • Data source and data warehouse• Issues of data quality• Data pre-processing operations• Data summary and visualisation• Online analytic processing (OLAP) • Data exploration and visualisation in Weka



Data, Data Types and Operations

• Data object and attributes– Data object or instance: individual independent

recording of a real life object/event.– Characterised by its recorded values on a fixed set of

features or attributes– Feature or attribute: a specific property or

characteristic of the data object.– Measurement: assigning a valid value to an attribute

according to an appropriate measurement scale.– Collection: collecting measurement results or

recorded values




• Data object and attributes (cont’d)– An example

123, “John Smith”, “03/02/1990”, 20, “male”, 1.82, 78

ID number, collected

Namecollected

Birthday collected

Agecalculated

Gender collected

Body heightmeasured

Body weightmeasured



Data, Data Types and Operations• Data object and attributes (cont’d)

– Measurement and measurement errors• Precision: the closeness of measurements to one another,

represented by the standard deviation of the measurements, e.g. repeated measure of body temperature

• Bias: a systematic variation of measurements from the intended quantity measurement, only known when external reference available, e.g. bias in weight measure instrument

• Accuracy: the closeness of the measure to the true value, indicated by the number of significant digits used in the measurement, e.g. measure of money: pound vs. penny

– Collection errors• Incorrect data recording at the point of entry, e.g. “Hongpo

Do” as for “Hongbo Du”




• Attribute domain types and operations– Categorical/Qualitative types

• Nominal, e.g. Gender (M, F)– A set of names: no concept of order nor difference– Operators applicable: =, – 1:1 transformation permissible, e.g. ID: 11 e901

• Ordinal, e.g. Grade (A, B, C, D, E)– A set of names: with order but no concept of difference– Operator applicable: =, , <, >, , – Order-preserving transformation permitted,

e.g. Grade: A First, B Second, C Third, D Pass, E BarePass.




• Attribute domain types and operations– Numeric/Quantitative types

• Interval, e.g. Temperature in C– A set of numeric values: both order and difference exist– Operators applicable: =, , <, >, , , +, -– e.g. temperature (F and C), calendar year– Transformation new = a*old + b permitted, e.g. F C

• Ratio, e.g. Length– A set of numeric values: order, difference and ratio– The set has an absolute zero– Operator applicable: =, , <, >, , , +, -, , – Transformation new = a*old permitted, e.g. meter feet



Data Sets

• Various forms– Table of records

• Relational table• Join of relational tables• Numerical spreadsheet (data matrix)• Boolean strings (document-term matrix)

– Ordered data• Time series and temporal sequence• Data sequence• Spatial data

– Graph-based data– Non record-based data



Data Sets• Various forms (illustrated)

Age Group Own Car Income Band Classyoung yes low riskyyoung no low risky

middle aged yes middle riskymiddle aged no high safemiddle aged yes low risky

young yes high riskymiddle aged no low safe

retired yes middle saferetired no middle saferetired yes high safe

Age Group Own Car Income Band Classyoung yes low riskyyoung no low risky

middle aged yes middle riskymiddle aged no high safemiddle aged yes low risky

young yes high riskymiddle aged no low safe

retired yes middle saferetired no middle saferetired yes high safe

Relational Table

TID Items100 apple, beer, newspaper200 apple, beef, beer, newspaper, potato 300 beef, potato400 beef, noodles500 beef, potato

TID Items100 apple, beer, newspaper200 apple, beef, beer, newspaper, potato 300 beef, potato400 beef, noodles500 beef, potato

Transaction Database

Data Matrix

Page1 link1 link2

Page2 link3

Page4 www zzzz

Page3xxxxyyyy

Web Structure

GGTTCCGCCTTCAGCCCCGCGCCCGCAGGG…

Data Sequence Spatial Data



Data Sets• Properties

– Type: file structure, e.g. ARFF for Weka, DAT for See5– Size: measured in terms of the total number of

records or total number of bytes, e.g. small (MB), medium (GB) and large (TB)

– Dimensionality: number of attributes– Sparsity:

• Values are skewed to some extreme or sub-ranges• Asymmetric values (some are more important than others)

– Resolution• Right level of data details• Related to the intended purpose



Data Sets• Properties (example insurance data set)

Type: ARFF

Size: 14722 records

Dimensionality: 7

Asymmetric: Y/N Skewed?Resolution: detailed



Data Source and Data Warehouse

• Sources of data– Local data source available– Local operational systems from different departments– Third-party external data source– Enterprise/Organisational data warehouse

• An organisational database for decision making• A central data repository separate from operational systems• Enforcing organisation-wide data consistency and integration• Providing data details as well as data summarisation• Providing data values as well as meta-data • Equipped with data analysis and reporting tools• As a data source for data mining



Data Source and Data Warehouse

• Star schema for data warehouse– Central fact table– Dimension tables– Limited use of join operations

Part(p#, pname, weight, colour)

Su

pp

lier(s

#, s

na

me

, city

, sta

tus

)

Pro

jec

t(pj#

, jna

me

, sta

tus

, da

te)

Supply(s#, p#, pj#, qty)



Issues of Data Quality• Main quality indicators

– Accuracy: data recorded with sufficient precision and little bias

– Correctness: data recorded without error and spurious objects

– Completeness: any parts of data records missing– Consistency: compliance with established rules and

constraints– Redundancy: unnecessary duplicates

Using the indicators to quantify quality of a data setImproving quality if possible



Issues of Data Quality

• Some examples– Accuracy & correctness with the road accident reports in

Exercise 1.3(c).

– Completeness with the UK family expenditure surveys in Exercise 1.3(a).

– Incompleteness introduced by data integration using outer join operation

– Consistency in questionnaires, e.g. eating fruit & veg. Q1: “give the fruit&veg portion consumed yesterday”: 2Q2: “give the fruit&veg portion consumed today:” 3Q3: “do you eat more today than yesterday?” No.

– Redundancy in a local company’s database of 40,000 records about 15,000 client companies.



Issues of Data Quality

• Why is quality important?– “Garbage in, garbage out!”– Total data quality control requires a cultural change

(comparing with total product quality control)– For data mining, tackling the quality issue at the data

source cannot be always expected• By cleaning the data as much as possible• By developing and using more tolerate mining solutions

– Data quality is relevant to the intended purpose of data mining, e.g. Do spelling errors in student names really matter when only the increase/decrease of student numbers in particular subject areas over the years is of interest?



Data Pre-processing• Overview

– Purpose: for speedy, cost-effective and high quality outcomes of data mining

– Pre-processing tasks (not all are independent from each other)

• Data aggregation • Data sampling• Dimension reduction• Feature selection• Feature creation• Discretisation/binarisation• Variable transformation• Dealing with missing values



Data Pre-processing• Data aggregation

– What: to summarise low level data details to higher level data abstraction

– Why: to reduce the time of mining, to rescale data values, and to discover more stable patterns

– How: • By generalisation using a

given concept hierarchy • By applying aggregate

functions (e.g. count, sum, average)

• Dropping some attributes

TI D Date I tem Store Pri ce Cl ubcard# ………… …… …… …… …… …… …… 32144 06/ 06/ 2006 mi l k Bucki ngham 1. 99 1111 ……11122 04/ 04/ 2006 watch Bucki ngham 25. 99 1011 ……11122 04/ 04/ 2006 bat tery Bucki ngham 3. 99 1011 ……11123 04/ 04/ 2006 beer Bucki ngham 9. 99 1022 ……22244 04/ 04/ 2006 beer MK 6. 99 1022 ……22244 04/ 04/ 2006 nappi es MK 10. 89 1022 ……23311 05/ 04/ 2006 beer MK 6. 99 1011 ………… …… …… …… …… …… ……

TI D Date I tem Store Pri ce Cl ubcard# ………… …… …… …… …… …… …… 32144 06/ 06/ 2006 mi l k Bucki ngham 1. 99 1111 ……11122 04/ 04/ 2006 watch Bucki ngham 25. 99 1011 ……11122 04/ 04/ 2006 bat tery Bucki ngham 3. 99 1011 ……11123 04/ 04/ 2006 beer Bucki ngham 9. 99 1022 ……22244 04/ 04/ 2006 beer MK 6. 99 1022 ……22244 04/ 04/ 2006 nappi es MK 10. 89 1022 ……23311 05/ 04/ 2006 beer MK 6. 99 1011 ………… …… …… …… …… …… ……

Date Store AveragePri ce ………… …… …… ……

06/ 06/ 2006 Bucki ngham 1. 99 ……04/ 04/ 2006 Bucki ngham 13. 32 ……04/ 04/ 2006 MK 8. 94 ……05/ 04/ 2006 MK 6. 99 ……

…… …… …… ……

Date Store AveragePri ce ………… …… …… ……

06/ 06/ 2006 Bucki ngham 1. 99 ……04/ 04/ 2006 Bucki ngham 13. 32 ……04/ 04/ 2006 MK 8. 94 ……05/ 04/ 2006 MK 6. 99 ……

…… …… …… ……

Number of I tems Total Pr i ce Cl ubcard# ………… …… …… ……

1 1. 99 1111 ……3 36. 97 1011 ……2 27. 87 1022 ……

…… …… …… ……

Number of I tems Total Pr i ce Cl ubcard# ………… …… …… ……

1 1. 99 1111 ……3 36. 97 1011 ……2 27. 87 1022 ……

…… …… …… ……



Data Pre-processing• Data sampling

– What: selecting a subset of the given data set

– Why: to make it possible to use sophisticated mining algorithms within a time limit.

– Caution: the sample must be representative of the original data set

– How:• Random sampling• Stratified sampling• Progressive sampling• With or without replacement

Data population

Selected subset

Sampling method



Data Pre-processing• Feature selection

– What: reducing dimensionality by selecting a subset of attributes

– Purposes: • To remove/reduce redundant features• To remove irrelevant features with no

useful information for the mining task

– How:• Manually with common sense and

domain knowledge• Letting the mining solution to select

suitable features (the embedded approach)

• Filter and wrapper approaches

attributes

Subset selection

One subset

evaluation

Stoppingcriterion

Selectedsubset

Validate withMining task

ok Not ok



Data Pre-processing

• Data dimension reduction– What: reduce redundancy implied among attributes

e.g. are all 9600 dimensions for a 120x80 pixel image necessary?

– Curse of dimensions: as dimensionality increases• Data become more diverse, and any patterns are getting

less significant and more peculiar.• The processing time may increase substantially.

– Why: to reduce redundancy and effects of the curse– How:

• Linear algebra techniques – Principal component analysis (PCA)– Independent component analysis (ICA)– Single value decomposition (SVD)

• Feature selection (as described before)



Data Pre-processing• Feature creation

– What: to create a new set of features from the original features

– Purpose: in the new feature space, meaningful and relevant patterns can be extracted more easily. The number of features may be reduced.

– How:• Using feature extraction methods to extract new features from the

existing ones, e.g. extracting colour, texture and shape from image of pixel values

• Mapping data to a new space, e.g. wavelet transformation of pixel values of images to a frequency domain

• Constructing new features from the existing ones using domain knowledge, e.g. using transaction dates to construct a new feature customer tenure that indicates the loyalty of the customer to the company



Data Pre-processing

• Data discretisation– What: to convert continuous

attribute values to discrete categorical values

– The purposes: • Requirement for some data mining

solutions• Better data mining results (not

always)

– How:1. Deciding how many categories to

have and where split points should be

2. Mapping values to categories

Determine the number & locations of the split points

Mapping values within each sub-range to a category label

t1 t2 t3 t4



Data Pre-processing

• Data discretisation (cont’d)– Discretisation methods:

• Unsupervised: without concern to the outcome of a specific attribute, normally used for clustering and association rule mining

e.g. equal width, equal depth, clustering

• Supervised: with respect to the outcome of the class attribute, normally used for classification

– Simple methods: sorting according to the class attribute, and then discretising the attribute values for each class.

– Sophisticated methods: the discretisation of the attribute values purifies the outcome of the class, e.g. using entropy to measure the degree of purity, and deciding the split points recursively, similar to decision tree induction

– Merging methods, merging small intervals into a larger one with a stop criterion



Data Pre-processing• Data binarisation

– What: to convert discrete categorical values to binary Boolean attribute values

– The purpose: the same as for discretisation– How:

• Convert m categorical values to values in [0, m-1]

• Convert each to binary number of n bits where n = log2m

• Use m asymmetric binary variables to represent each of m values



Data Pre-processing• Variable transformation

– What: transform all values of an attribute to other values

– The purposes:• Remove the effect of the outlier values• Make the result data visualisation more interpretable

• Make the values more comparable – How:

• Transformation using function

e.g. log(x)• Standardisation/normalisation

e.g. division-by-range



Data Pre-processing Handling missing values

– What: to treat attributes with null values– The purposes:

• Improve data quality• Better mining results

– How:• Elimination (may not always be possible)• Using sensible default, e.g. Spending Amount is set to 0• By data imputation

– Average, median, or mode of the whole data population– Average, median or mode of the nearest neighbours

• Postponing the handling and making the mining methods adaptive to missing values



Data Exploration• Exploring data before mining

– Knowing data is essential for successful data mining– Purposes:

• Better understanding of the characteristics of data• Better decision over data pre-processing tasks• Even being able to discover some hidden patterns

– Categories of data exploration techniques• Summary statistics: using a small set of descriptors to

describe the characteristics of a large data set• Data visualisation: using graphical or tabular forms to reveal

hidden data patterns• Online Analytic Processing (OLAP)

– Data exploration and exploratory data analysis (EDA)



Data Exploration• Summary statistics

– Frequency and mode for categorical attributes:• Frequency of value• Mode: the most frequently occurred value

– Percentiles for ordinal or continuous attributes:• Given an attribute x and an integer p (0p100), the

percentile xp is a value of x such that p% observed values of x are less than xp.

– Mean and median for continuous attributes:• Mean and median• Median is a better indication of “average” when data

distribution is skewed or outliers are present

– Trimmed mean and median (after trimming top and bottom p%)



Data Exploration• Summary statistics (cont’d)

– Measures of spread:• Range

• Variance (2)

• Standard Deviation ()

• Absolute average deviation (AAD)

– Multivariate summary statistics• Mean vector

• Matrix of covariance

• Correlation

2

1

2 )(1

1xx

m

m

ii

)min()max()( xxxrange

2

1

)(1

1xx

m

m

ii

||1

)(1

xxm

xAADm

ii

),...,,( 21 nxxxx

))((1

1),(covariance

1

yyxxm

yx i

m

ii

yx

yxyx

),(covariance

),(ncorrelatio



Data Exploration• Data visualisation

– Rationale: human eyes are good at spotting patterns, particularly visual patterns.

– Major ways of visualising data• Tabular form• Graphical form• Points and links

– Visual representation must be related to the data types of the attributes

– Visualising data as well as all its implicit relationships– The visualisation must be comprehensible– The visualisation of data must tell the truth



Data Exploration• Data visualisation techniques

Pie Chart

Bar Chart

Stem & Leaf Plot

Scatter Plot

Parallel Dimension Chart

Star Dimension Chart



Data Exploration• Online analytic processing (OLAP)

– Interactive reporting tool – Treating a data set as a multidimensional hypercube– Fast operation and fast result delivery– A typical OLAP query:

“For each product, find its market share in its category today minus its market share in its category in 1994”

– Result of the OLAP query:Products Market Share Today Market Share in 1994 Difference

Dell 17" 17% 10% 7%HP 15" 83% 90% -7%Intel MotherB 56% 93% -37%

… … … …

Products Market Share Today Market Share in 1994 DifferenceDell 17" 17% 10% 7%HP 15" 83% 90% -7%Intel MotherB 56% 93% -37%

… … … …



Data Exploration• OLAP: Multidimensional hypercube

Jan Feb DecMarchBuckingham

Milton Keynes

Northampton

1998

20001999

• Total Customer = 5• Customer Names

March

Milton Keynes1999

Branch Name Customer Name Month YearBuckingham Helen Miles April 2000Buckingham Mary Laughton April 1999

…… …… …. ….Milton Keynes Alen Young Feb 2000Milton Keynes Susan Young April 2000

…… …… …. ….Northampton Frank Sinatra April 1998

………… …. …. ….

Branch Name Customer Name Month YearBuckingham Helen Miles April 2000Buckingham Mary Laughton April 1999

…… …… …. ….Milton Keynes Alen Young Feb 2000Milton Keynes Susan Young April 2000

…… …… …. ….Northampton Frank Sinatra April 1998

………… …. …. ….



• OLAP: Hierarchies

winter spring summerBuckingham

Milton Keynes

Northampton

1998

20001999

autumn

Data Exploration

January February March

winter

April May June

spring

July August September

summer

October November December

autumn

Jan Feb DecMarchBuckingham

Milton Keynes

Northampton

1998

20001999



Data Exploration• OLAP: Operations

– Pivoting • Selecting attributes to define the cube• Visually rotating the cube to show a face

– Slicing and dicing • Selecting a part of a cube• Visually slicing a segment of a cube along a dimension

– Rolling-up• Moving up along a hierarchy

– Drilling-down • Moving down along a hierarchy

– Performing aggregate functions while rolling-up or drilling-down



Data Exploration in Weka Explorer• ARFF file format

Schema section

Data section

Data set name

Categorical attribute name and values

Numeric attribute names and types

One data record per line;Values separated by “,”;“?” represents unknown.



Data Exploration in Weka Explorer• Glance of an opened data set

Summary statistics

Visualisation of value distribution



Data Exploration in Weka Explorer• Visualisation in Weka (limited)



Data Exploration in Weka Explorer• Filters for pre-processing

– Many filters– Supervised/unsupervised– Attribute/instance– Choose followed by

parameter setting in command line



Chapter Summary• The domain types determine the validity of operations applied.

• Transformation from one domain to another must preserve the domain characteristics.

• Data sets can be of various forms and from different sources.

• Data warehouse serves as a data source for data mining.

• Data quality is relevant to the intended application purpose.

• Data pre-processing operations are essential for good mining.

• Knowing the data is important for good data mining.

• Understanding of data is achieved via exploring, summarising and visualising data.

• OLAP serves as a data exploration and summarisation tool.



References

Read Chapter 3 of Data Mining Techniques and Application

Useful further references• Tan, P-N., Steinbach, M. and Kumar, V. (2006),

Introduction to Data Mining, Addison-Wesley, Chapters 2 and 3

data mining techniques and applications, 1 st edition hongbo du isbn 978-1-84480-891-5 © 2010...

Documents

data types

data mining techniques

operations data object

attributes data object

data time series

recordbased data slide

edition hongbo

weka slide