introduction to data mining -...
TRANSCRIPT
1
Introduction to data mining
G. Marcou+
+Laboratoire d’infochimie, Université
de Strasbourg, 4, rue Blaise
Pascal, 67000 Strasbourg
Motivation of data mining
Discover automatically useful information in large data repository.
Extract patterns from experience.
Predict outcome of future observations.
Learning:
Set of Task
If Experience
increase, performance measure
on the set of tasks
increases
Experience Preformance
Measure
Organisation of data
Datasets are organized as instances/attributes
Instances
Attributes
SynonymsData pointsEntriesSample…
SynonymsFactorsVariablesMeasures...
Nature of data
Attributes can be:Numeric
Nominal
Continuous
Categorical
Ordered
Ranges
Hierarchical
Atom counts:
O=1
Cl=4
N=6
S=3
Molecule
name:
(1‐methyl)(1,1,1‐tributyl)azanium, tetrahexylammonium
Molecular surface:
Phase state: solid, amorphous, liquid, gas, ionized
Intestinal absorption:
not absorbed, mildly absorbed, completely absorbed
Spectral domains:
EC numbers:visibleUV IR
EC 1. OxidoreductasesEC 2. TransferasesEC 3. HydrolasesEC 4. LyasesEC 5. IsomerasesEC 6. Ligases
EC 6.1 Forming Carbon‐Oxygen Bonds EC 6.2 Forming Carbon‐Sulfur Bonds EC 6.3 Forming Carbon‐Nitrogen Bonds EC 6.4 Forming Carbon‐Carbon Bonds EC 6.5 Forming Phosphoric Ester Bonds EC 6.6 Forming Nitrogen—Metal Bonds
Nature of learning
Unsupervised learningClustering
Rules
Supervised learningClassification
Regression
OtherReinforcement
First order logic
+
xRyyx ∃∀ ,
A Concept is the target function to be learned
Concept is learned fromattributes‐values
Relations
Sequences
Spatial
Concept in data mining
Instance 1Instance 2Instance 3
…
Instance 1Instance 2Instance 3
…
DB1DB1
DB2DB2
Machine Learning and Statistics
Statistician point of view
Datasets are the expression of underlying probability distributions
Datasets validate or invalidate prior hypothesis
Data miner point of view
Any hypothesis compatible with the dataset is useful
Search for all hypothesis compatible with the dataset
Induction Deduction
Validation in Data Mining
Validation means that a model is build on a training set of data then applied on a test set of data.
Success and failure on the test set must be estimated.
The estimate is supposed to be representative of any new situation.
Every model must be validated.
Training/Test
Split the dataset in two parts:
One part is the training set
The other is the test set
Bootstrapping
Draw N instances with replacement from the dataset
Create a training set with these instances
Use the dataset as the test set
Cross-Validation
Split the dataset in Nsubsets
Use each subset as a test set while all others form a training set
Scrambling
Reassign at random the classes to the instances.
Success and failure are estimated on the scrambled data.
The goal is to estimate good success measurement by pure chance.
Clustering
Search for an internal organization of the data
Optimizes relations between instances relative to an objective function
Typical objective functions:
Separation Coherence
DensityContiguity
Concept
Cluster Evaluation
Essential because any dataset can be clustered by not any cluster is meaningful.
Evaluation canUnsupervised
Supervised
Relative
Unsupervised Cluster evaluation
Cohesion Separation
Silhouette
Proximity matrixCoPhenetic
Correlation
For p
Nearest Neighbor Distances (NND) between
instances (ωi
) and NND between rand points (ui
)
Clustering
Tendency
Supervised cluster evaluation
Recall(3,1)
Class1
Ni
, number of members of cluster i
Precision(3,1)
Cluster3
pijPrecision(i,j) Recall(i,j)
Relative analysis
Compare two clustering.
Supervised cluster analysis is a special case of relative analysis
The reference clustering is the set of classes
Rand statistics Jaquard statistics
N00
: number of instances couple in different clusters for both clusteringN11
: number of instances couple in same clusters for both clustersN01
: number of instances couple in different clusters for the first
clustering and in the same clusters for the secondN10
: number of instances couple in the same clusters for the first clustering and in different one for the second.
A simple clustering algorithm: k-mean
1.
Select k
points as centroids
2.
Form k clusters: each point is assigned to its closest centroid
3.
Reset each centroid
to the (geometric) center of its cluster
4.
Repeat from point 2 until no change is
observed5.
Repeat from point 1
until stable average clusters are obtained.
X
XX
X
X
X
Classification
DefinitionAssign one or several objects to predefined categories
The target function maps a set of attributes x to a set of classes y.
Learning schemeSupervised learning
Attribute‐value
GoalPredict the outcome of future observations
Probabilities basics
Conditional probabilitiesIndependence of random events:
Probability of realization of event A knowing that B has occurred
The Bayes equation for independent events xi
Statistical approach to classification
Estimate the probability of an instance {x1,x2}being of Class1 or Class2.
Class 1
Class2
The probability that an instance {x1,x2,…} belongs to class A is difficult to estimate.
Poor statistics
Consider the Bayes Equation:With the naive assumption that {x1,x2,…} are independent
The prior probability, the evidence and the likelihood have better estimates
Good statistics
The Naive Bayes assumption
Posterior Probability
Prior ProbabilityLikelihood
Evidence
The Naive Bayes Classifier1.
Estimate the prior probability, P(A), for each
class.2.
Estimate the likelihood, P(x|A), of each attribute
for each class.3.
For a new instance, estimate the Bayes Score for
each class:
4.
Assign the instance to the class which possesses the highest score
•
The value of C
can be optimized
Success and failureFor N instance and a give classifier, for each class INTP(i):
True Positives•
Number of instances of class i
correctly classified.
NFP(i):False Positives•
Number of instances incorrectly assigned to class i.
NTN(i):True Negatives•
Number of instances of other classes correctly classified.
NFN(i):False Negatives•
Number of instances of class i
incorrectly assigned to other classes.
Confusion MatrixFor N instances, K classes and a classifier
Nij, the number of instances of class iclassified as j
Class1 Class2 … ClassK
Class1 N11 N12 … N1K
Class2 N21 N22 … N2K
… … … … …
ClassK NK1 NK2 … NKK
Classification Evaluation
Global measures of successMeasures are estimated on all classes
Local measures of successMeasures are estimated for each class
Ranking success evaluation
Receiver Operator Curve (ROC)
Receiver Operator Curve Area Under the Curve (ROC AUC)
Recall
1‐Specificity
Losses and RisksErrors on a different class prediction has different costs
What does it cost to mistakenly assign an instance of one class to another?
Normalized Expected Cost
Probability Cost Function
Class1 Class2 … ClassK
Class1 0 C12 … C1K
Class2 C21 0 … C2K
… … … … …
ClassK CK1 CK2 … 0
Cost matrix Cij
Asymmetric matrix
Cost CurveWorse classifier
Ideal Classifier
NFP
NTP
Probability Cost function
Accept All Classifier
Reject All Classifier
Actual Classifier
Normalized
expected cost
Conclusion
Data mining extracts useful information from datasets
Clustering:Unsupervised
Information about the data
Classification:Supervised
Build models in order to predict outcome of future observations
Multi-Linear Regression
y=ax+b
a
b
Sum ofSquared Errors (SSE)