classification discrimination lecture 15. what is discrimination or classification? consider an...
TRANSCRIPT
![Page 1: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/1.jpg)
CLASSIFICATIONDISCRIMINATION
LECTURE 15
![Page 2: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/2.jpg)
What is Discrimination or Classification?
• Consider an example where we have two populations P1 and P2 each ~ N(m1,s1) and N(m2,s2) respectively.
• A new observation is observed and it is known to come from either of these populations.
• The task of a discriminant function is to determine a “rule” to decide from which of the two populations x is most likely to come from.
• How we come up with a rule is what we need to study.
![Page 3: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/3.jpg)
Supervised Learning
• In computer Science this is known as SUPERVISED learning.
• Essentially we know the class labels ahead of time. • What we need to do is find a RULE using features in the
data that DISCRIMINATES effectively between the classes.
• So that if we have a new observation with its features we can correctly classify it.
![Page 4: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/4.jpg)
Example 1
• Suppose you are a doctor considering two different anesthetics for a patient.
• You have some information about the patient, gender, age, some medical history variables.
• So what we need is a data set where we have patient information and whether or not the anesthetic was SAFE for that patient.
• So what you want to do is USING the available variables build a MODEL or RULE that says whether anesthetic A or B is better for the patient.
• Then use this rule to decide whether or not to give the new patient A or B.
![Page 5: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/5.jpg)
Example 2: Turkey Thief
• There was this legal case in Kansas where a turkey farmer accused his neighbor of stealing turkeys from the farm.
• When the neighbor was arrested and the police looked in the freezer, there were multiple frozen turkeys there.
• The accused claimed these were WILD turkey that he had caught.• The Statistician was called in to give evidence as there are some
biological differences between domestic and wild turkey. • So the biologist measured the bones and other body characteristic of
the domestic and Wild turkeys and the Statistician built a DISCRIMANT function.
• They used the classification function to see if the turkeys in the freezer fell into he WILD or DOMESTIC class.
• THEY ALL fell in the DOMESTIC classification!
![Page 6: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/6.jpg)
The Idea
• USING knowledge of the classes we build the FUNCTION.
• We want to minimize misclassification error.
• Question: Should we use ALL the data to build the MODEL, because then we really do not have a good way to find out the misclassification probabilities.
• Generally: Training set and Testing sets are used.
![Page 7: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/7.jpg)
Some common Statistical Rules
• Suppose we want to classify between two multivariate normal distribution P1 with parameters 1m and 1S and P2 with parameters 2m and 2S .
• Suppose a new observation vector x is known to come from P1 or P2.
• There are various Statistical Rules allow us to PREDICT which population x most likely came from.
![Page 8: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/8.jpg)
1. Likelihood Rule
Choose P1 if L(x, 1, 1m s ) > L(x, 2, 2m s ) else choose P2.
Here, x is the observation vector.
This is a mathematical rule and reasonable under the assumption of normality.
![Page 9: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/9.jpg)
2. Linear Discriminant Function (LDA)rule:
Choose P1 if b’x – k > 0 and P2 otherwise.
Here b= S-1( 1- 2)m m and k=1/2( 1- 2)m m S-1( 1+ 2m m )The function b’x is called the linear discriminant
function.This assumes equal covariance matrices 1= 2=S S S.
It’s a single linear function of x that summarizes all the information in x.
![Page 10: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/10.jpg)
3. Mahalanobis Distance Rule
Choose P1 if d1 < d2
where di = (x-mi)S-1(x-mi) for i=1,2.
The function di is a measure of how far away x is from mi taking the Variance-Covariance into account.
This assumes equal covariance matrices 1= 2=S S S.The Likelihood criterion under normality and equal
variance is equivalent to this Rule.
![Page 11: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/11.jpg)
4. Posterior probability rule
Choose P1 if P(P1|x)>P(P2|x) where, P(Pi|x) = exp[(-1/2)di]/{exp[(-1/2)d1] + exp[(-1/2)d2] }
• Also assumes equal variance.• Not a true probability as (P1|x) is not a random event as
the observation belongs to either P1 or P2. • Gives an idea of how confident we are in our effort to
discriminate.
![Page 12: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/12.jpg)
Caveats
Generally mi and si are not known and we use sample values.
Under equal covariance all 4 rules are equivalent in terms of discrimination between groups.
Also in general we have more than 2 populations to discriminate the observations into.
![Page 13: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/13.jpg)
Sample Discriminant Rules
• Since we never know the parameters 1, 2, 1, 2m m S S . we use sample estimates generally MLE estimates below and form discrimant rules as in given before.
•
2
)1()1(
,,,
21
2211
2121
NN
SNSNSPooled
SSxx
![Page 14: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/14.jpg)
Estimating Probability of Misclassification
• 1. Re-substitution Estimates:
Apply the discriminant function to the data used to develop the rule and see how well it discriminates in general.
USES the SAME data to make and validate models.
![Page 15: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/15.jpg)
Holdout Data:
Keep a part of the data out from the part used to construct the rule and use the rule on that part and see how well it performs.
Problem is: if you don’t have a lot of samples its not the most efficient use of resources for building the model.
![Page 16: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/16.jpg)
Cross Validation:
Remove one observation at a time from the set, and construct the rule from the remaining observations and predict the first, do this for the second and third…
Define a summary matrix for misclassifying each data point.
Also called Jack-knifing.
• Obviously a rule classifying correctly a HIGHER number of times is preferred.
![Page 17: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/17.jpg)
The Issue for MA
• Often it is known in advance WHERE the samples come from and what conditions they have been exposed to.
• In fact we are often interested in gene expression profiles to distinguish between different conditions or classes.
• In the past schemes like a voting scheme was used to look at class membership in MAs.
• MANY MANY methods available, but general consensus is that a few of the methods have robust performance e.g. Linear discriminant Function (LDA), k-Nearest Neighbors (k-NN).
![Page 18: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/18.jpg)
![Page 19: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/19.jpg)
Cost Function and Prior Probabilities
• When we there are only two populations all the four rules discussed earlier have the property that probability of misclassifying 1 to 2 is the same as 2 to 1.
• NOT generally a good idea especially in our anesthetic example. Idea is if you are going have to err, err in the side of caution.
• Hence we need to take into account the COST of misclassification.
![Page 20: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/20.jpg)
Some Math Details
• Define U = b’x-k from LDA.• U=( 1- 2m m )’S-1x - .5 ( 1- 2m m )’S-1 ( 1+ 2m m )• Under Normality and equal variance, • if x comes from P1, U ~ N(d,d) • and if x comes from P2, U ~ N(-d,d)
• Where d =( 1- 2m m )’S-1 ( 1- 2m m )
• And our Rule for LDA is P1 if U > 0 and p2 otherwise.• To make it asymmetric you can use a rule U > u where we can pick the
probability of misclassifying into one of the populations at most a fixed number say alpha.
![Page 21: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/21.jpg)
A General Rule• Define Cost Function as C(i|j) the cost of misclassifying an
observation from Pj to Pi.• Define Prior probability as pi for the ith group.• Average Cost of Misclassification (two groups)• p1C(2|1)P(2|1) + p2C(1|2)P(1|2)• Bayes Rule: Choose P1 • if p2f(x;q2)C(1|2) > p1f(x;q1)C(2|1)
• Observe if p1=p2 and C(2|1)=C(1|2) this reduces to the Likelihood rule.
• Under Normality and equal variance it reduces to:• d1* < d2* where d1* = .5(x- 1m )’S-1 (x- 1m ) – log(p1.C(2|1))
![Page 22: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/22.jpg)
Probabilistic Classification Theory (PCT)
• Most classification methods can be described as special implementations of Bayes’ Classifiers. The decision rule for classifying x into one of the classes P1…,Pk depends upon:– Prior information about the class frequencies p1…pk.– Information about how the class membership affects the gene
expression profiles xi (i=1…n)– Misclassification costs C(j,i) of classifying an observation which
belongs to class Pi into Pj.
• Our aim is to find a classification rule R that minimizes the expected Classification Costs.
![Page 23: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/23.jpg)
![Page 24: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/24.jpg)
PCT II: Bayes Rule
• Recall Cost of Misclassification is given by:
• C(j|i) = 0 if i=j• = Ci , if i j (generally Ci is set to 1)
• Result: the classification rule that minimizes the expected misclassification cost is given by the posterior probability:
• R(x) = arg Min P(C|x) = arg Min P(x|C)pc
• This is called the Bayes Rule.
![Page 25: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/25.jpg)
PCT III: Prior Information
• Hence the idea is: IF we know the Probability of Class membership pc, and the conditional probability of the data given the classifiers P(x|C), we can find the optimum Classification Rule.
• In general it is VERY difficult to KNOW the prior information about class membership.
• To find P(x|C) the Likelihood of the data, we often use the Normal distribution (or log-transformed gene expression to be Normal). This is done in the Training set.
![Page 26: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/26.jpg)
Steps in Discriminant Analysis in MA
• Selection of features:• Model Fitting• Model Validation:
![Page 27: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/27.jpg)
Selection of Features
Selecting a set of genes. We do not want all the genes since it may have a tendency to over-fit the data also causes singularity.
How to select genes (gene filtering):– Use ONLY differentially expressed genes using an ANOVA type
model: xi = a C(xi) + ei
– Look at multiple genes or gene groups. Do PCA on all the genes. Not very efficient
– Partial least Squares(PLS), finds orthogonal linear combinations that maximize Cov (Xl,y).
– Do PCA and then rank PCAs by ratio of between class to within class varaince
– Other methods are Projection pursuit etc.
Most common differential expression or PLS
![Page 28: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/28.jpg)
MODEL FITTING
• Commonly used:• LDA• K Nearest Neighbor
• Other related• DLDA (Diagonal LDA)• RDA (Regularized DA) (there is a R package for this) • PAM (Prediction Analysis for Microarrays) (there is a R package
for this)• FDA (Flexible DA)
![Page 29: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/29.jpg)
Validation
• See how well the classifiers classify the observations into the different classes.
• Mostly commonly used method leave-one-out-cross validation.
• Though test data set (holdout sample) and resubmissions are still used.
![Page 30: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/30.jpg)
Linear Discriminant Analysis(LDA)
• Easy useful method.• Been found to be robust in MA.• Idea:• The main assumption is that the class densities can be written as
Multivariate Normal. • In R one uses lda in the MASS library.• Hence,
– P(x| C=k) = MVN ( m1…mk, Skk)
– Maximize : P(C=k| x) ={ P(x| C=k)pk}/S(P(x|C=j)pj
– If feature set is known then it is fairly straight forward, else one has to use some technique (forward, backward or step-wise) for feature selection.
![Page 31: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/31.jpg)
K-nearest Neighbor (kNN)• Assumption: samples with almost the same feature should belong to
the same class. In other words given a set of genes (g1,…,gm) known to be important in class membership, the kNN classifier assigns an unclassified sample to the class prevalent among the k samples whose expression values for the m genes are closest in the sample of interest.
• Typically each profile for sample j, is compared to the other profiles using Euclidean distances (however, any other distance like Manhattan, Correlation can be useful as well).
• The aim of kNN is to estimate the posterior probability P(C(X)=j|X=x) of a gene profile belonging to a class directly.
• For a particular k, it estimates the probability as a relative fraction of samples that belong to class j, among the k samples with most similar profiles.
• Essentially a non-linear classifier and may have VERY irregular edges.
![Page 32: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/32.jpg)
lda example from R
• Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),• + Sp = rep(c("s","c","v"), rep(50,3)))• > train <- sample(1:150, 75)• > table(Iris$Sp[train])
• c s v • 27 24 24 • > ## your answer may differ• > ## c s v• > ## 22 23 30
![Page 33: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/33.jpg)
Running lda
• > z <- lda(Sp ~ ., Iris, prior = c(1,1,1)/3, subset = train)• > predict(z, Iris[-train, ])$class• [1] s s s s s s s s s s s s s s s s s s s s s s s s s s c c c c c c c c c c c c• [39] c c c c c c c c c c c v v v v c v v v v v v v v v v v c v v c v v v v v v• Levels: c s v
![Page 34: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/34.jpg)
Contd…
• > (z1 <- update(z, . ~ . - Petal.W.))• Call:• lda(Sp ~ Sepal.L. + Sepal.W. + Petal.L., data = Iris, prior = c(1, • 1, 1)/3, subset = train)
• Prior probabilities of groups:• c s v • 0.3333333 0.3333333 0.3333333
![Page 35: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/35.jpg)
Contd…
• Group means:• Sepal.L. Sepal.W. Petal.L.• c 5.955556 2.781481 4.359259• s 5.008333 3.450000 1.429167• v 6.637500 2.983333 5.629167
• Coefficients of linear discriminants:• LD1 LD2• Sepal.L. 0.9045765 -0.07677002• Sepal.W. 0.7347963 2.58009411• Petal.L. -3.1529282 0.37700694
• Proportion of trace:• LD1 LD2 • 0.9939 0.0061
![Page 36: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697c0011a28abf838cc2961/html5/thumbnails/36.jpg)
knn
• library(class)• > train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])• > test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])• > cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))• > knn(train, test, cl, k = 3, prob=TRUE)