Download - DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
![Page 1: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/1.jpg)
DATA MINING from data to information
Ronald WestraDep. MathematicsKnowledge EngineeringMaastricht University
![Page 2: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/2.jpg)
PART 1
Introduction
![Page 3: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/3.jpg)
All information on math-part of course on:
http://www.math.unimaas.nl/personal/ronaldw/DAM/DataMiningPage.htm
![Page 4: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/4.jpg)
![Page 5: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/5.jpg)
![Page 6: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/6.jpg)
Data mining - a definition
"Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data
in order to discover meaningful patterns and results."
(Berry & Linoff, 1997, 2000)
![Page 7: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/7.jpg)
DATA MINING
Course Description:
In this course the student will be made familiar with the main topics in Data Mining, and its important role in current Computer Science. In this course we’ll mainly focus on algorithms, methods, and techniques for the representation and analysis of data and information.
![Page 8: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/8.jpg)
DATA MINING
Course Objectives:
To get a broad understanding of data mining and knowledge discovery in databases.
To understand major research issues and techniques in this new area and conduct research.
To be able to apply data mining tools to practical problems.
![Page 9: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/9.jpg)
LECTURE 1: Introduction
1. Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996), Data Mining to Knowledge Discovery in Databases: http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf
2. Hand, D., Manilla, H., Smyth, P. (2001), Principles of Data Mining, MIT press, Boston, USA MORE INFORMATION ON: ELEUMand:http://www.math.unimaas.nl/personal/ronaldw/DAM/DataMiningPage.htm
![Page 10: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/10.jpg)
Hand, D., Manilla, H., Smyth, P. (2001),
Principles of Data Mining,
MIT press, Boston, USA
+ MORE INFORMATION ON: ELEUM or DAM-website
![Page 11: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/11.jpg)
LECTURE 1: Introduction
What is Data Mining?• data information knowledge
• patterns structures models
The use of Data Mining• increasingly larger databases TB (TeraBytes)
• N datapoints and K components (fields) per datapoint
• not accessible for fast inspection
• incomplete, noise, wrong design
• different numerical formats, alfanumerical, semantic fields
• necessity to automate the analysis
![Page 12: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/12.jpg)
LECTURE 1: Introduction
Applications
• astronomical databases
• marketing/investment
• telecommunication
• industrial
• biomedical/genetica
![Page 13: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/13.jpg)
LECTURE 1: Introduction
Historical Context
• in mathematical statistics negative connotation:
• danger for overfitting and erroneous generalisation
![Page 14: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/14.jpg)
LECTURE 1: Introduction
Data Mining Subdisciplines
• Databases
• Statistics
• Knowledge Based Systems
• High-performance computing
• Data visualization
• Pattern recognition
• Machine learning
![Page 15: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/15.jpg)
LECTURE 1: Introduction
Data Mining -methodes
• Clustering
• classification (off- & on-line)
• (auto)-regression
• visualisation techniques: optimal projections and PCA (principal component analysis)
• discrimnant analysis
• decomposition
• parameteriical modelling
• non-parameteric modeling
![Page 16: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/16.jpg)
LECTURE 1: Introduction
Data Mining essentials
• model representation
• model evaluation
• search/optimisation
Data Mining algorithms
• Decision trees/Rules
• Nonlinear Regression and Klassificatie
• Example-based methods
• AI-tools: NN, GA, ...
![Page 17: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/17.jpg)
LECTURE 1: Introduction
Data Mining and Mathematical Statistics
• when Statistics and when DM?
• is DM a sort of Mathematical Statistics?
Data Mining and AI
• AI is instrumental in finding knowledge in large chunks of data
![Page 18: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/18.jpg)
Mathematical Principles in Data Mining
Part I: Exploring Data Space
* Understanding and Visualizing Data Space
Provide tools to understand the basic structure in databases. This is done by probing and analysing metric structure in data-space, comprehensively visualizing data, and analysing global data structure by e.g. Principal Components Analysis and Multidimensional Scaling.
* Data Analysis and Uncertainty
Show the fundamental role of uncertainty in Data Mining. Understand the difference between uncertainty originating from statistical variation in the sensing process, and from imprecision in the semantical modelling. Provide frameworks and tools for modelling uncertainty: especially the frequentist and subjective/conditional frameworks.
![Page 19: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/19.jpg)
Mathematical Principles in Data Mining
PART II: Finding Structure in Data Space
* Data Mining Algorithms & Scoring Functions
Provide a measure for fitting models and patterns to data. This enables the selection between competing models. Data Mining Algorithms are discussed in the parallel course.
* Searching for Models and Patterns in Data Space
Describe the computational methods used for model and pattern-fitting in data mining algorithms. Most emphasis is on search and optimisation methods. This is required to find the best fit between the model or pattern with the data. Special attention is devoted to parameter estimation under missing data using the maximum likelihood EM-algorithm.
![Page 20: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/20.jpg)
Mathematical Principles in Data Mining
PART III: Mathematiscal Modelling of Data Space
* Descriptive Models for Data Space
Present descriptive models in the context of Data Mining. Describe specific techniques and algorithms for fitting descriptive models to data. Main emphasis here is on probabilistic models.
* Clustering in Data Space
Discuss the role of data clustering within Data Mining. Showing the relation of clustering in relation to classification and search. Present a variety of paradigms for clustering data.
![Page 21: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/21.jpg)
EXAMPLES
* Astronomical Databases
* Phylogenetic trees from DNA-analysis
![Page 22: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/22.jpg)
Example 1: Phylogenetic Trees
The last decade has witnessed a major and historical leap in biology and all related disciplines. The date of this event can be set almost exactly to November 1999 as the Humane Genome Project (HGP) was declared completed. The HGP resulted in (almost) the entire humane genome, consisting of about 3.3.109 base pairs (bp) code, constituting all approximately 35K humane genes. Since then the genomes of many more animal and plant species have come available. For our sake, we can consider the humane genome as a huge database, existing of a single string with 3.3.109 characters from the set {C,G,A,T}.
![Page 23: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/23.jpg)
Example 1: Phylogenetic Trees
This data constitutes the human ‘source code’. From this data – in principle – all ‘hardware’ characteristics, such as physiological and psychological features, can be deduced. In this block we will concentrate on another aspect that is hidden in this information: phylogenetic relations between species. The famous evolutionary biologist Dobzhansky once remarked that: ‘Everything makes sense in the light of evolution, nothing makes sense without the light of evolution’. This most certainly applies to the genome. Hidden in the data is the evolutionary history of the species. By comparing several species with various amount of relatedness, we can from systematic comparison reconstruct this evolutionary history. For instance, consider a species that lived at a certain time in earth history. It will be marked by a set of genes, each with a specific code (or rather, a statistical variation around the average).
![Page 24: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/24.jpg)
Example 1: Phylogenetic Trees
If this species is by some reason distributed over a variety of non-connected areas (e.g. islands, oases, mountainous regions), animals of the species will not be able to mate at a random. In the course of time, due to the accumulation of random mutations, the genomes of the separated groups will increasingly differ. This will result in the origin of sub-species, and eventually new species. Comparing the genomes of the new species will shed light on the evolutionary history, in that: we can draw a phylogenetic tree of the sub-species leading to the ‘founder’-species; given the rate of mutation we can estimate how long ago the founder-species lived; reconstruct the most probable genome of the founder-species.
![Page 25: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/25.jpg)
![Page 26: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/26.jpg)
![Page 27: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/27.jpg)
![Page 28: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/28.jpg)
![Page 29: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/29.jpg)
![Page 30: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/30.jpg)
![Page 31: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/31.jpg)
Example 2: data mining in astronomy
![Page 32: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/32.jpg)
Example 2: data mining in astronomy
![Page 33: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/33.jpg)
Example 2: data mining in astronomy
![Page 34: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/34.jpg)
![Page 35: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/35.jpg)
![Page 36: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/36.jpg)
DATA AS SETS OF MEASUREMENTS AND
OBSERVATIONS
Data Mining Lecture II[Chapter 2 from Principles of Data Mining
by Hand,, Manilla, Smyth ]
![Page 37: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/37.jpg)
LECTURE 2: DATA AS SETS OF MEASUREMENTS AND OBSERVATIONS
Readings:
• Chapter 2 from Principles of Data Mining by Hand, Mannila, Smyth.
![Page 38: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/38.jpg)
2.1 Types of Data
2.2 Sampling 1. (re)sampling
2. oversampling/undersampling, sampling artefacts
3. Bootstrap and Jack-Knife methodes
2.3 Measures for Similarity and Difference1. Phenomenological
2. Dissimilarity coefficient
3. Metric in Data Space based on distance measure
![Page 39: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/39.jpg)
Types of data
Sampling :
– the process of collecting new (empirical) data
Resampling :
– selecting data from a larger already existing collection
![Page 40: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/40.jpg)
Sampling
–Oversampling
–Undersampling
–Sampling artefacts (aliasing, Nyquist frequency)
![Page 41: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/41.jpg)
Sampling artefacts (aliasing, Nyquist frequency)
Moire fringes
![Page 42: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/42.jpg)
Resampling
Resampling is any of a variety of methods for doing one of the following:
– Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (= jackknife) or drawing randomly with replacement from a set of data points (= bootstrapping)
– Exchanging labels on data points when performing significance tests (permutation test, also called exact test, randomization test, or re-randomization test)
– Validating models by using random subsets (bootstrap, cross validation)
![Page 43: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/43.jpg)
Bootstrap & Jack-Knife methodes
using inferential statistics to account for randomness and uncertainty in the observations. These inferences may take the form of answers to essentially yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), prediction of future observations, descriptions of association (correlation), or modeling of relationships (regression).
![Page 44: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/44.jpg)
Bootstrap method
bootstrapping is a method for estimating the sampling distribution of an estimator by resampling with replacement from the original sample.
"Bootstrap" means that resampling one available sample gives rise to many others, reminiscent of pulling yourself up by your bootstraps.
cross-validation: verify replicability of results
Jackknife: detect outliers
Bootstrap: inferential statistics
![Page 45: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/45.jpg)
2.3 Measures for Similarity and Dissimilarity
1. Phenomenological
2. Dissimilarity coefficient
3. Metric in Data Space based on distance measure
![Page 46: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/46.jpg)
2.4 Distance Measure and Metric
1. Euclidean distance
2. Metric
3. Commensurability
4. Normalisatie
5. Weighted Distances
6. Sample covariance
7. Sample covariance correlation coefficient
8. Mahalanobis distance
9. Normalised distance and Cluster Separation (zie aanvullende tekst)
10. Generalised Minkowski
![Page 47: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/47.jpg)
2.4 Distance Measure and Metric
1. Euclidean distance
![Page 48: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/48.jpg)
2.4 Distance Measure and Metric
2. Generalized p-norm
![Page 49: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/49.jpg)
Generalized Norm / Metric
![Page 50: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/50.jpg)
Minkowski Metric
![Page 51: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/51.jpg)
Minkowski Metric
![Page 52: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/52.jpg)
Generalized Minkowski Metric
In the data space is already a structure present.
The structure is represented by the correlation and given by the covariance matrix G
The Minkowski-norm of a vector x is:
xGxx T 2
![Page 53: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/53.jpg)
2.4 Distance Measure and Metric
1. Euclidean distance
2. 2. Metric
3. Commensurability
4. Normalisatie
5. Weighted Distances
6. Sample covariance
7. Sample covariance correlation coefficient
8. Mahalanobis distance
9. Normalised distance and Cluster Separation (zie aanvullende tekst)
10. Generalised Minkowski
![Page 54: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/54.jpg)
2.4 Distance Measure and Metric
Mahalanobis distance
![Page 55: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/55.jpg)
2.4 Distance Measure and Metric
8. Mahalanobis distance
The Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936.
It is based on correlations between variables by which different patterns can be identified and analysed. It is a useful way of determining similarity of an unknown sample set to a known one.
It differs from Euclidean distance in that it takes into account the correlations of the data set.
![Page 56: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/56.jpg)
2.4 Distance Measure and Metric
8. Mahalanobis distance
The Mahalanobis distance from a group of values with mean
and covariance matrix Σ for a multivariate vector
is defined as:
![Page 57: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/57.jpg)
2.4 Distance Measure and Metric
8. Mahalanobis distance
Mahalanobis distance can also be defined as dissimilarity measure between two random vectors x and y of the same distribution with the covariance matrix Σ :
![Page 58: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/58.jpg)
2.4 Distance Measure and Metric
8. Mahalanobis distance
If the covariance matrix is the identity matrix then it is the same as Euclidean
distance. If covariance matrix is diagonal, then it is called normalized Euclidean distance:
where σi is the standard deviation of the xi over the sample set.
![Page 59: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/59.jpg)
2.4 Distance measures and Metric
8. Mahalanobis distance
![Page 60: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/60.jpg)
2.4 Distance measures and Metric
8. Mahalanobis distance
![Page 61: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/61.jpg)
2.4 Distance measures and Metric
8. Mahalanobis distance
![Page 62: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/62.jpg)
2.5 Distortions in Data Sets
1. outlyers
2. Variance
3. sampling effects
2.6 Pre-processong data with mathematical transformationes
2.7 Data Quality
• Data quality of individual measurements [GIGO]
• Data quality of Data collections
![Page 63: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062304/568134aa550346895d9bbb16/html5/thumbnails/63.jpg)