supervised machine learning - unimi.itmarray.economia.unimi.it › 2005 › material › l7.pdf ·...

91
Supervised Machine Supervised Machine Learning Learning Lecture 7 Computational and Statistical Aspects of Microarray Analysis June 23, 2005 Bressanone, Italy

Upload: others

Post on 26-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Supervised MachineSupervised MachineLearningLearning

Lecture 7Computational and Statistical Aspects

of Microarray AnalysisJune 23, 2005 Bressanone, Italy

Page 2: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Common Types of ObjectivesCommon Types of Objectives

• Class Comparison– Identify genes differentially expressed among

predefined classes such as diagnostic orprognostic groups.

• Class Prediction– Develop multi-gene predictor of class for a

sample using its gene expression profile• Class Discovery

– Discover clusters among specimens or amonggenes

Page 3: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

What is the taskWhat is the task• Given the gene profile predict the class

• Mathematical representation: findfunction f that maps x to {1,…,K}

• How do we do this?

Page 4: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

PossibilitiesPossibilities• Have expert tell us what genes to look

for being over/under expressed?• Then we do not really need microarrray

• Use clustering algorithms?• Not appropriate for this taks…

Page 5: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Clustering is not a good tool

Page 6: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

A: 450 relevant genes plus 450 “noise” genes. B: 450 relevant genes.

Simulated Data with 4 clusters: 1-10, 11-20, 21-30, 31-40

Page 7: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Problem with clusteringProblem with clustering• Noisy genes will ruin it for the rest

• How do we know which genes to use

• We are ignoring useful information inour prototype data: We know theclasses!

Page 8: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Train an algorithmTrain an algorithm• A powerful approach is to train a

classification algorithm on the data wecollected and propose the use of it in thefuture

• This has successfully worked in manyareas: zip code reading, voicerecognition, etc

Page 9: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Clustering is not a good tool

Page 10: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Using multiple genesUsing multiple genes• How do we combine information from various

genes to help us form our discriminantfunction f ?

• There are many methods out there… threeexamples are LDA, kNN, SVM

• Weighted gene voting and PAM weredeveloped for microarrays (but they are just aversions of DLDA)

Page 11: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Weighted Gene Voting is DLDAWeighted Gene Voting is DLDA

∑=

−=

G

g g

kggk

xx

12

2)()(

σ

µδ

(x 1g − x 2g )ˆ σ g

2g=1

G

∑ xg −x 1g + x 2g( )

2

≥ 0

( ) 01

≥−∑=

gg

G

gg bxa

2

21

ˆ

)(

g

ggg

xxa

σ

−=

( )2

21 ggg

xxb

+=

gg

ggg

xxa

21

21

ˆˆ

)(

σσ +

−=

With equal priors, DLDA is:

With two classes we select class 1 if

This can be written as

with

Weighted Gene Voting simply uses

Notice the units and scale fore sum are wrong!

Page 12: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

KNNKNN• Another simple and useful method is K

nearest neighbors

• It is very simple

Page 13: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

ExampleExample

Page 14: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Too many genesToo many genes• A problem with most existing

approaches: They were not developedfor p>>n

• A simple way around this is to filtergenes first: Pick genes that, marginally,appear to have good predictive power

Page 15: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Beware of over-fittingBeware of over-fitting• With p>>n you can always find a

prediction algorithm that predictsperfectly on the training set

• Also, many algorithm can be made to metoo flexible. An example is KNN with K=1

Page 16: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

ExampleExample

Page 17: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Split-Sample EvaluationSplit-Sample Evaluation• Training-set

– Used to select features, select model type, determineparameters and cut-off thresholds

• Test-set– Withheld until a single model is fully specified using the

training-set.– Fully specified model is applied to the expression profiles in

the test-set to predict class labels.– Number of errors is counted

Note: Also called cross-validation

Page 18: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

ImportantImportant• You have apply the entire algorithm,

from scratch, on the train set

• This includes the choice of feature gene,and in some cases normalization!

Page 19: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

ExampleExample

Number of misclassifications

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Proportion of simulated data sets

0.00

0.05

0.10

0.90

0.95

1.00

Cross-validation: none (resubstitution method)Cross-validation: after gene selectionCross-validation: prior to gene selection

Page 20: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Keeping yourself honestKeeping yourself honest• CV

• Try out algorithm on reshuffled data

• Try it out on completely random data

Page 21: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

ConclusionsConclusions• Clustering algorithms not appropriate

• Do not reinvent the wheel! Many methodsavailable… but need feature selection (PAMdoes it all in one step!)

• Use cross validation to assess

• Be suspicious of new complicated methods:Simple methods are already too complicated.

Page 22: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Class Prediction ModelClass Prediction Model• Given a sample with an expression profile vector x of

log-ratios or log signals and unknown class, predictwhich class the sample belongs to

• The class prediction model is a function f which mapsfrom the set of vectors x to the set of class labels {1,2}(if there are two classes).

• f generally utilizes only some of the components of x(i.e. only some of the genes)

• Specifying the model f involves specifying someparameters (e.g. regression coefficients) by fitting themodel to the data (learning the data).

Page 23: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Do Not Confuse Statistical MethodsDo Not Confuse Statistical MethodsAppropriate for Class Comparison withAppropriate for Class Comparison withThose Appropriate for Class PredictionThose Appropriate for Class Prediction

• Demonstrating statistical significance of prognostic factors is notthe same as demonstrating predictive accuracy.

• Demonstrating goodness of fit of a model to the data used todevelop it is not a demonstration of predictive accuracy.

• Statisticians are used to inference, not prediction• Most statistical methods were not developed for p>>n prediction

problems• But some are…

Page 24: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Components of ClassComponents of ClassPredictionPrediction

• Feature (gene) selection– Which genes will be included in the model

• Select model type– E.g. DLDA, Nearest-Neighbor, …

• Fitting parameters (regressioncoefficients) for model

• Assessing the model

Page 25: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Feature SelectionFeature Selection• Key component of supervised analysis• Genes that are univariately differentially expressed

among the classes at a significance level α (e.g. 0.01)– The α level is selected to control the number of genes in the

model, not to control the false discovery rate• Methods for class prediction are different than those for class

comparison– The accuracy of the significance test used for feature

selection is not of major importance as identifyingdifferentially expressed genes is not the ultimate objective

– For survival prediction, the genes with significant univariateCox PH regression coefficients

Page 26: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Feature SelectionFeature Selection• Small subset of genes which together

give most accurate predictions

• Many published complex methods forselecting combinations of genes do notappear to have been properly evaluated

Page 27: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Linear MethodsLinear Methods• Many popular methods are linear

• Linear means that the prediction rule is alinear combination of the x– f(x) = Ax

• Some examples follow

Page 28: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Linear Classifiers for Two ClassesLinear Classifiers for Two Classes• Fisher linear discriminant analysis (weights based on

assumed multi-variate normal distribution ofexpression vector in each class with commoncovariance matrix)

• Diagonal linear discriminant analysis (DLDA) assumesfeatures are uncorrelated

• Compound covariate predictor and Golub’s weightedvoting method are variants of DLDA

Page 29: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Linear Classifiers for Two ClassesLinear Classifiers for Two Classes

• Support vector machines with innerproduct kernel are linear classifiers withweights determined to minimize errors

• Perceptrons with principal componentsas input are linear classifiers with nowell defined criterion for definingweights

Page 30: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Nearest Neighbor ClassifierNearest Neighbor Classifier

• To classify a sample in the validation set as beingin outcome class 1 or outcome class 2, determinewhich sample in the training set it’s geneexpression profile is most similar to.– Similarity measure used is based on genes selected as

being univariately differentially expressed between theclasses

– Correlation similarity or Euclidean distance generallyused

• Classify the sample as being in the same class asit’s nearest neighbor in the training set

• For a fixed neighborhood size, this turns out tobe linear as well

Page 31: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Advantages of Simple LinearAdvantages of Simple LinearClassifiersClassifiers

• Do not over-fit data– Incorporate influence of multiple variables

without attempting to select the best smallsubset of variables

– Do not attempt to model the multivariateinteractions among the predictors andoutcome

Page 32: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

• When p>>n, a linear classifier canalmost always be found which fits thedata perfectly.

• Why consider more complex models?• The full set of linear models is too rich

and selecting a linear model to minimizetraining errors does not lead togeneralizable results when p>>n.

Page 33: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

MythMyth• That complex classification algorithms

such as neural networks perform betterthan simpler methods for classprediction.

Page 34: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

• Artificial intelligence sells to journal reviewersand peers who cannot distinguish hype fromsubstance when it comes to microarray dataanalysis.

• Comparative studies have shown that simplermethods work as well or better for microarrayproblems because the number of candidatepredictors exceeds the number of samples byorders of magnitude.

• Dudoit, Fridlyand and Speed JASA 2001

Page 35: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

• Fitting complex classifiers to training dataresults in unstable models unless thetraining dataset is huge

• For unstable models, the training set errorrate is strongly downward biased as anestimate of the generalization error rate

• For unstable models, the cross-validatederror rate is a highly variable estimate of thegeneralization error rate

• Stability can be improved by– Restriction to simpler models– Include penalty for complexity if fitting criterion– Use fitting criterion incorporating robustness to

changes in data

Page 36: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Evaluating a ClassifierEvaluating a Classifier• “Prediction is difficult, especially the

future.”– Neils Bohr

• Fit of a model to the same data used todevelop it is no evidence of predictionaccuracy for independent data.

Page 37: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Leave-one-out CrossLeave-one-out CrossValidationValidation

• Leave-one-out cross-validationsimulates the process of separatelydeveloping a model on one set of dataand predicting for a test set of data notused in developing the model

Page 38: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

training set

test set

spec

imen

s

log-expression ratios

spec

imen

s

log-expression ratios

full data set

Non-cross-validated Prediction

Cross-validated Prediction (Leave-one-out method)

1. Prediction rule is built using full data set.2. Rule is applied to each specimen for class

prediction.

1. Full data set is divided into training andtest sets (test set contains 1 specimen).

2. Prediction rule is built from scratchusing the training set.

3. Rule is applied to the specimen in thetest set for class prediction.

4. Process is repeated until each specimenhas appeared once in the test set.

Page 39: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Common mistakeCommon mistake• Cross-validation of a model can occur

after selecting the genes to be used inthe model

Page 40: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

• Cross validation is only valid if the trainingset is not used in any way in thedevelopment of the model. Using thecomplete set of samples to select genesviolates this assumption and invalidatescross-validation.

• With proper cross-validation, the model mustbe developed from scratch for each leave-one-out training set. This means that geneselection must be repeated for each leave-one-out training set.

• The cross-validated estimate ofmisclassification error applies to the modelbuilding process, not to the particular modelor the particular set of genes used in themodel.

Page 41: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Prediction on Simulated Null DataPrediction on Simulated Null Data

Generation of Gene Expression Profiles• 20 specimens (Pi is the expression profile for specimen i)• Log-ratio measurements on 6000 genes• Pi ~ MVN(0, I6000)• Can we distinguish between the first 10 specimens (Class 1) andthe last 10 (Class 2)?

Prediction Method• Compound covariate prediction• Compound covariate built from the log-ratios of the 10 mostdifferentially expressed genes.

Page 42: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Incomplete (incorrect) Cross-Incomplete (incorrect) Cross-ValidationValidation

• Publications are using all the data to select genes andthen cross-validating only the parameter estimationcomponent of model development– Highly biased– Many published complex methods which make strong claims

based on incorrect cross-validation.• Frequently seen in complex feature set selection algorithms• Some software encourages inappropriate cross-validation

Page 43: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Incomplete (incorrect) Cross-Incomplete (incorrect) Cross-ValidationValidation

• Let M(b,D) denote a classification model developed ona set of data D where the model is of a particular typethat is parameterized by a scalar b.

• Use cross-validation to estimate the classificationerror of M(b,D) for a grid of values of b; Err(b).

• Select the value of b* that minimizes Err(b).• Caution: Err(b*) is a biased estimate of the prediction

error of M(b*,D).• This error is made in some commonly used methods

Page 44: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Permutation Distribution of Cross-Permutation Distribution of Cross-validated Misclassification Rate ofvalidated Misclassification Rate of

a Multivariate Classifiera Multivariate Classifier• Randomly permute class labels and repeat the

entire cross-validation• Re-do for all (or 1000) random permutations of

class labels• Permutation p value is fraction of random

permutations that gave as fewmisclassifications as in the real data

Page 45: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Invalid Criticisms of Cross-Invalid Criticisms of Cross-ValidationValidation

• “You can always find a set of features that willprovide perfect prediction for the training andtest sets.”– For complex models, there may be many sets of

features that provide zero training errors.– A modeling strategy that either selects among

those sets or aggregates among those models, willhave a generalization error which will be validlyestimated by cross-validation.

Page 46: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Potential Sources of Bias inPotential Sources of Bias inEstimation of Error RatesEstimation of Error Rates

• Confounding by sample handling or assayeffects– Design evaluation carefully

• Failure to incorporate important sources offuture variability– Assay drift

• Change in distribution of un-modeledvariables– In split sample validation, split samples by

institution

Page 47: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

More detailsMore details……

Page 48: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

y: phenotype (black vs white)x: expression position of the pointc: class indicators

Page 49: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Predictive ModelingPredictive Modeling

Goal: learn a mapping: y = f(x;θ)

Need: 1. A model structure

2. A score function3. An optimization strategy

Categorical y ∈ {c1,…,cm}: classification

Real-valued y: regressionNote: usually assume {c1,…,cm} are mutually exclusiveand exhaustive

Page 50: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Error RateError Rate• In general we write

• With L the loss function (usually quadratic)• When we are predicting classes the loss function is a

matrix that assigns a loss for every pair (truth,predictor)

• With only two classes this reduces to something quitesimple. What is it?

EPE = EX EY /X [L{Y − f (X)} | X]

Page 51: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Probabilistic ClassificationProbabilistic Classification

Let p(ck) = prob. that a randomly chosen object comes from ck

Objects from ck have: p(x |ck , θk) (e.g., MVN)

Then: p(ck | x ) ∝ p(x |ck , θk) p(ck)

Bayes Error Rate: dxxpxcpp kk

B )())|(max1(* ∫ −=

•Lower bound on the best possible error rate

Page 52: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Bayes error rate about 6%

Page 53: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Classifier TypesClassifier Types

Discrimination: direct mapping from x to{c1,…,cm}

- e.g. perceptron, SVM, CART

Regression: model p(ck | x )

- e.g. logistic regression, CART

Class-conditional: model p(x |ck , θk)

- e.g. “Bayesian classifiers”, LDA

Page 54: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Linear DiscriminantLinear DiscriminantAnalysisAnalysis

K classes, X n _ p data matrix.

p(ck | x ) ∝ p(x |ck , θk) p(ck)

Could model each class density as multivariate normal:

)()(21

212

1

||)2(

1)|(

kkT

k xx

kpk excp

µµ

π

−Σ−− −

Σ=

LDA assumes for all k. Then:Σ≡Σk

)()()(2

1

)(

)(log

)|(

)|(log 11

lkT

lkT

lkl

k

l

k xcp

cp

xcp

xcpµµµµµµ −Σ+−Σ+−= −−

This is linear in x.

Page 55: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Linear Discriminant AnalysisLinear Discriminant Analysis(cont.)(cont.)

It follows that the classifier should predict:

)(log2

1)( 11

kkTkk

Tk cpxx +Σ−Σ= −− µµµδ

“linear discriminant function”

If we don’t assume the Σk’s are identicial, get Quadratic DA:

)(log)()(2

1||log

2

1)( 1

kkkT

kkk cpxxx +−Σ−−Σ−= − µµδ

argmaxk δk (x)

Page 56: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Linear Discriminant AnalysisLinear Discriminant Analysis(cont.)(cont.)

Can estimate the LDA parameters via maximum likelihood:

kki

ik Nx /ˆ ∑∈

NNcp kk /)(ˆ =

)/()')((ˆ1

KNxxK

k kikiki −−−=Σ ∑∑

= ∈

µµ

Page 57: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical
Page 58: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical
Page 59: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

LDA QDA

Page 60: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical
Page 61: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

first linear discriminant

seco

nd li

near

dis

crim

inan

t

-5 0 5 10

-6-4

-20

24

6

s

s ss

s

s

ss

s

s

s

s

ss

ss

s

s

ss

ss

ss s

s

ss s

ss

s

ss

s s

ss

s

s s

s

s

s

s

s

s

s

s

s

ccc

c

cc

c

c

c

c

c

c

c

cc

c

cc

cc

c

cc

ccc

cc

c

c

c c

cc c

cc

c

c

cc

c

c

c

c

ccc

c

c

v

v

v

vv

v

v

v

v

v

v

v

v

v

v

vv

v

v

v

v

v

v

v

vv

vvv

vv

v

v v

v

v vv

v

vv v

v

vv

v

v

v

v

s

Page 62: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

LDA (cont.)LDA (cont.)

•Fisher is optimal if the class are MVN with a commoncovariance matrix

•Computational complexity O(mp2n)

Page 63: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Logistic RegressionLogistic RegressionNote that LDA is linear in x:

)()()(2

1

)(

)(log

)|(

)|(log 0

10

10

00

µµµµµµ −Σ+−Σ+−= −−k

Tk

Tk

kk xcp

cp

xcp

xcp

xTkk αα += 0

Linear logistic regression looks the same:

xxcp

xcp Tkk

k ββ += 00 )|(

)|(log

But the estimation procedure for the coefficients is different.LDA maximizes joint likelihood [y,X]; logistic regressionmaximizes conditional likelihood [y|X]. Usually similar predictions.

Page 64: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Logistic Regression MLELogistic Regression MLEFor the two-class case, the likelihood is:

{ }∑=

−−+=n

iiiii xpyxpyl

1

));(1log()1();(log)( βββ

xxp

xp Tβββ

=

− );(1

);(log ))exp(1log();(log xxxp TT βββ +−=

{ }∑=

++=⇒n

i

TTi xxyl

1

))exp(1log()( βββ

The maximize need to solve (non-linear) score equations:

∑=

=−=n

iiii xpyx

d

dl

1

0));(()(

βββ

Page 65: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Logistic Regression ModelingLogistic Regression ModelingSouth African Heart Disease Example (y=MI)

4.1840.0100.043Age0.1360.0040.001Alcohol-1.1870.029-0.035Obesity4.1780.2250.939Famhist3.2190.0570.185ldl3.0340.0260.080Tobacco1.0230.0060.006sbp-4.2850.964-4.130InterceptZ scoreS.E.Coef.

Wald

Page 66: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Tree ModelsTree Models

•Easy to understand

•Can handle mixed data, missing values, etc.

•Sequential fitting method can be sub-optimal

•Usually grow a large tree and prune it back rather

than attempt to optimally stop the growing process

Page 67: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical
Page 68: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

TrainingTrainingDatasetDataset

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

ThisfollowsanexamplefromQuinlan’sID3

Page 69: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Output: A Decision Tree forOutput: A Decision Tree for““buys_computerbuys_computer””

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

Page 70: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical
Page 71: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical
Page 72: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical
Page 73: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Confusion matrix

Page 74: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Algorithm for Decision TreeAlgorithm for Decision TreeInductionInduction

• Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-conquer

manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they are discretized

in advance)– Examples are partitioned recursively based on selected attributes– Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain)• Conditions for stopping partitioning

– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning – majority

voting is employed for classifying the leaf– There are no samples left

Page 75: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Information GainInformation Gain(ID3/C4.5)(ID3/C4.5)

• Select the attribute with the highest information gain• Assume there are two classes, P and N

– Let the set of examples S contain p elements of class P and nelements of class N

– The amount of information, needed to decide if an arbitraryexample in S belongs to P or N is defined as

npn

npn

npp

npp

npI++

−++

−= 22 loglog),(

e.g. I(0.5,0.5)=1; I(0.9,0.1)=0.47; I(0.99,0.01)=0.08;

Page 76: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Information Gain inInformation Gain inDecision Tree InductionDecision Tree Induction

• Assume that using attribute A a set S will bepartitioned into sets {S1, S2 , …, Sv}– If Si contains pi examples of P and ni examples of N, the

entropy, or the expected information needed to classifyobjects in all subtrees Si is

• The encoding information that would be gained bybranching on A

∑= +

+=

ν

1),()(

iii

ii npInpnp

AE

)(),()( AEnpIAGain −=

Page 77: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Attribute Selection byAttribute Selection byInformation Gain ComputationInformation Gain Computationγ Class P: buys_computer =

“yes”γ Class N: buys_computer =

“no”γ I(p, n) = I(9, 5) =0.940γ Compute the entropy for age: Hence

Similarlyage pi ni I(pi, ni)

<=30 2 3 0.97130…40 4 0 0>40 3 2 0.971

694.0)2,3(14

5

)0,4(14

4)3,2(

14

5)(

=+

+=

I

IIageE

048.0)_(151.0)(029.0)(

===

ratingcreditGainstudentGainincomeGain

246.0

)(),()(

=

−= ageEnpIageGain

Page 78: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

GiniGini Index (IBMIndex (IBMIntelligentMinerIntelligentMiner))

• If a data set T contains examples from nclasses, gini index, gini(T) is defined as

where pj is the relative frequency of class j in T.• If a data set T is split into two subsets T1 and T2

with sizes N1 and N2 respectively, the gini indexof the split data contains examples from nclasses, the gini index gini(T) is defined as

• The attribute provides the smallest ginisplit(T) ischosen to split the node

∑=

−=n

jp jTgini1

21)(

)()()( 22

11 Tgini

NN

TginiNNTginisplit +=

Page 79: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Avoid Avoid Overfitting Overfitting ininClassificationClassification

• The generated tree may overfit the training data– Too many branches, some may reflect anomalies due to

noise or outliers– Result is in poor accuracy for unseen samples

• Two approaches to avoid overfitting– Prepruning: Halt tree construction early—do not split a

node if this would result in the goodness measure fallingbelow a threshold

• Difficult to choose an appropriate threshold– Postpruning: Remove branches from a “fully grown”

tree—get a sequence of progressively pruned trees• Use a set of data different from the training data to

decide which is the “best pruned tree”

Page 80: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Approaches to DetermineApproaches to Determinethe Final Tree Sizethe Final Tree Size

• Separate training (2/3) and testing (1/3)sets

• Use cross validation, e.g., 10-fold crossvalidation

• Use minimum description length (MDL)principle:– halting growth of the tree when the

encoding is minimized

Page 81: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Nearest Neighbor MethodsNearest Neighbor Methods

•k-NN assigns an unknown object to the most common class ofits k nearest neighbors

•Choice of k? (bias-variance tradeoff again)

•Choice of metric?

•Need all the training to be present to classify a new point (“lazymethods”)

•Surprisingly strong asymptotic results (e.g. no decision rule ismore than twice as accurate as 1-NN)

Page 82: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Flexible Metric NNFlexible Metric NNClassificationClassification

Page 83: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

NaNaïïve Bayes Classificationve Bayes ClassificationRecall: p(ck |x) ∝ p(x| ck)p(ck)

Now suppose:

Then:

Equivalently:

C

x1 x2 xp…

∏=

∝p

jkjkk cxpcpxcp

1

)|()()|(

∑¬¬¬

+=)|(

)|(

)(

)(log

)|(

)|(log

kj

kj

k

k

k

k

cxp

cxp

cp

cp

xcp

xcp

“weights ofevidence”

Page 84: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Evidence Balance SheetEvidence Balance Sheet

Page 85: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

NaNaïïve Bayes (cont.)ve Bayes (cont.)

•Despite the crude conditional independence assumption, works

well in practice (see Friedman, 1997 for a partial explanation)

•Can be further enhanced with boosting, bagging, model

averaging, etc.

•Can relax the conditional independence assumptions in myriad

ways (“Bayesian networks”)

Page 86: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Dietterich (1999)

Analysis of 33 UCI datasets

Page 87: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

An ExampleAn Example

Page 88: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Gene-Expression Profiles inHereditary Breast Cancer

• Breast tumors studied:7 BRCA1+ tumors8 BRCA2+ tumors7 sporadic tumors

• Log-ratios measurements of3226 genes for each tumorafter initial data filtering

cDNA MicroarraysParallel Gene Expression Analysis

RESEARCH QUESTIONCan we distinguish BRCA1+ from BRCA1– cancers and BRCA2+ fromBRCA2– cancers based solely on their gene expression profiles?

Page 89: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

BRCA1BRCA1

?? g

# of

significant genes

# of misclassified

samples (m)

% of random permutations with

m or fewer misclassifications

10-2 182 3 0.4 10-3 53 2 1.0 10-4 9 1 0.2

Page 90: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

BRCA2BRCA2

?? g

# of significant

genes

m = # of misclassified elements

(misclassified samples)

% of random permutations with m

or fewer misclassifications

10-2 212 4 (s11900, s14486, s14572, s14324) 0.8 10-3 49 3 (s11900, s14486, s14324) 2.2 10-4 11 4 (s11900, s1 4486, s14616, s14324) 6.6

Page 91: Supervised Machine Learning - unimi.itmarray.economia.unimi.it › 2005 › material › L7.pdf · 2009-10-28 · Supervised Machine Learning Lecture 7 Computational and Statistical

Classification of BRCA2 Classification of BRCA2 GermlineGermlineMutationsMutations

45%Classification Tree

18%Support Vector Machine (linear kernel)

23%3-Nearest Neighbor

9%1-Nearest Neighbor

14%Diagonal LDA

36%Fisher LDA

14%Compound Covariate Predictor

LOOCV PredictionError

Classification Method