1 multi-class cancer classification noam lerner. 2 our world – very generally genes. gene samples....

53
1 Multi-Class Cancer Classification Noam Lerner

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

1

Multi-Class Cancer Classification

Noam Lerner

Page 2: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

2

Our world – very generally

Genes. Gene samples. Our goal: classifying the samples.

Example: We want to be able to determine if a certain sample belongs to a certain type of cancer.

Page 3: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

3

Our problem

Say that we have p genes, and N samples.

Normally, p<N, so it’s easy to classify samples.

What if N<p?

Page 4: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

4

The algorithm scheme

Gene screening. Dimension reduction. Classification.

We’ll present 3 variations of this algorithm scheme.

Page 5: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

5

Before gene screening - Classes

Normally, a class of genes is a set of genes that behave similarly under certain conditions.

Example: One can divide genes into a class of genes that indicate a certain type of cancer, and to another class of genes, that do not indicate.

Taking it one step further:

Page 6: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

6

Multi-classes

Diving a group of genes into two or more classes is called a “Multi-class”.

What is it good for?Distinguishing between types of cancer.Example: Leukemia:

AML B-ALL T-ALL

Page 7: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

7

Gene Screening

Generally, gene screening is a method that is used to disregard unimportant genes.

Example: gene predictors.

Page 8: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

8

The Gene Screening process

Suppose we have G classes that represent G types of cancer. (We know which genes belong in each class).

We compare every two classes pair-wise and see if the expression is greater than a certain critical score. ( is the mean of the r-th set of the multi set)

'r rx x

rx

Page 9: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

9

What is the critical score?

MSE – mean squared error. - the size of the r-th multi set. t – arises from student’s t-distribution.

'

1 1t

r r

MSEn n

rn

Page 10: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

10

Student’s t-distribution

t distribution is used to estimate the mean and variance of a normally distributed population when the sample size is small.

Fact: The t distribution depends on the size of the population, but not on the mean nor in the variance of the items in the population. The lack of dependence is what makes the t-distribution important in both theory and practice.

Anecdote: William S. Gosset published a paper on this subject, under the pseudonym student, and that’s how the distribution got its name.

Page 11: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

11

The student t test

The t-test assesses whether the means of two groups are statistically different from each other.

This analysis is appropriate whenever you want to compare the means of two groups.

It is assumed that the two groups have the same variance.

Page 12: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

12

The student t test (cont.)

Consider the next three situations:

Page 13: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

13

The student t test (cont.)

The first thing to notice about the three situations is that the difference between the means is the same in all three.

We would want to conclude that the two groups are similar in the high-variability case, and the two groups are distinct in the low-variability case.

Conclusion: when we are looking at the differences between scores for two groups, we have to judge the difference between their means relative to the spread or variability of their scores. The student t-test does just this.

Page 14: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

14

The student t test (cont.)

We say that two classes passed the student t test if the t is greater than a certain parameter.

Risk level: usually Degrees of freedom: Look it up in a table.

'

'

'

var varr r

r r

r r

x xt

n n

' 2.r rn n 0.05a

Page 15: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

15

Dimension Reduction

It appears that we need more than Gene Screening.

Reminder: We have p genes, N samples, N<p. Most classification methods (the next phase of

the algorithm) assume that p<N. The solution: dimension reduction: reducing the

gene space dimension from p to K where K << N.

Page 16: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

16

Dimension Reduction (cont.)

This is done by constructing K gene components and then classifying the cancers based on the constructed K gene components.

Multivariate Partial Least Squares (MPLS) is a dimension reduction method.

Example:

Page 17: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

17

Example Reducing dimension from 35 to 3 (5

classes).

Page 18: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

18

Example (cont.)

This is the NCI60 data set that contains 5 various types of cancer.

Page 19: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

19

MPLS

Suppose we have G classes. Suppose y indicates the cancer classes 1,

…,G. We define a row for every sample:

Fix a K (our desired reduction dimension)

1

0i

ik

y ky

else

Y

Page 20: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

20

MPLS (cont.)

Suppose X is the gene expression values matrix.

Suppose t1,…,tK are linear combination of X. Then, the MPLS finds (easily) two unit

vectors: w and c such that the following expression is maximized:

Then , the MPLS extracts t1,…,tK, and we are done.

2cov , .XwYc

Page 21: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

21

Why maximizing the covariance?

If cov(x,y)>0 then y increases as x increases. If cov(x,y)<0 then y decreases as x

increases. By maximizing the covariance, we get that Yc

increases as Xw increases. That way, we get a good estimation of Yc by

Xw, and we have found our MPLS components: t1,…,tK.

Page 22: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

22

Classification

After we have reduced the dimension of the gene space, we need to actually classify the sample(s).

It’s important to pick a classification method that will work properly after dimension reduction.

We’ll present two different methods: PD and QDA.

Page 23: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

23

PD (Polychotomous Discrimination)

Recall the indicator y that indicates the cancer classes 1,…,G.

Set a vector . Then, the distribution of y depends on x.

(We think of y as a random variable). We also suppose that

1,..., px xx

1,..., . | 0r G P y r x

Page 24: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

24

PD (cont.)

We define

After a few mathematical transitions we get that

This is the probability that a sample with gene expression profile x is of cancer class r.

|

log1|r

P rg

P

xx

x

0

exp|

1 expr

Krr

gP r

gS

xx

x

Page 25: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

25

PD (cont.)

By looking at the previous formula through a certain mathematical model, we can maximize a parameter, that holds all the data.

The parameter can be maximized only if there are more samples (N) than parameters (p), and by using dimension reduction, we got just that.

b

Page 26: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

26

PD (cont.)

So, instead of looking at we’ll look at the corresponding gene component profile, .

Now, let’s look at the new probabilities, that rely on the new: .

Finally, we’ll say that (and therefore ) belong to the r-th cancer class if

A more detailed explanation on PD is given on the presentation’s appendix.

1,...,p

px x x ¡

ˆ Kx ¡

x̂ ˆ ˆ|P r x

x

ˆ ˆ. | | .s r P s P r x x

Page 27: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

27

QDA (Quadratic Discriminant Analysis) Recall the indicator y that indicates the

cancer classes 1,…,G. Consider the following multivariate normal

model (for each cancer class):

2| , .r ry r N m sx :

Page 28: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

28

QDA (cont.)

Suppose is the classification of the r-th cancer class, then

Where is ‘s pdf function.

rC

| .pr s s r rC s r p f p f x x x¡

rp P y r

rf |y rx

Page 29: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

29

QDA (cont.)

Again, instead of looking at

we’ll look at the corresponding gene component profile, , and get the desired classification.

1,...,p

px x x ¡

ˆ Kx ¡

Page 30: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

30

Review - the big picture

Gene screening – allows us to get rid of genes that won’t tell us anything.

Dimension reduction – allows us to reduce the the gene space – and work on the data.

Classification – allows us to decide if a sample has a cancer of a certain multi-class.

Page 31: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

31

Just before the algorithm

We would want a way to assess if we generated a correct classification.

In order to do that – we use LOOCV.

Page 32: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

32

LOOCV

LOOCV stands for Leave Out One Cross Validation.

In this process, we remove one data point from our data, run our algorithm, and try to estimate the removed data point using our results, as if we didn’t know the original data point. Then, we assess the error.

This step is repeated for every data point, and finally, we accumulate the errors in some sort for a final error estimation.

Page 33: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

33

The 1st algorithm variation

1. Gene screening: select a set S of m genes, giving an expression matrix X of size N x m.

2. Dimension reduction: Use MPLS to reduce X to T where T is of size N x K.

3. Classification: For i=1 to N do1. Leave out sample (row) i of T, 2. Fit the classifier to the remaining N-1 samples and

use the fitted classifier to predict the left out sample i.

Page 34: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

34

The 2nd algorithm variation

1. Gene screening: select a set S of m genes, giving an expression matrix X of size N x m.

2. For i=1 to N do 1. Leave out sample (row) i of the expression matrix X

creating X-i

2. Dimension reduction: Use MPLS to reduce X-i to T-i where T-i is of size N x K.

3. Classification: Fit the classifier to the remaining N-1 samples and use the fitted classifier to predict the left out sample i.

Page 35: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

35

Class question

Q: What is the difference between the 1st and 2nd variations?

A1: In the 1st variation, steps 1 and 2 are fixed with respect to LOOCV. Therefore, the effect of gene screening and dimension reduction on the classification cannot be assessed.

A2: In the 2nd variation, we can assess the effect of the dimension reduction.

Page 36: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

36

More on the 1st variation

Results show that the 1st variation does not yield good results. (The classification error rates were more optimistic than the expected error rates.

Taking it to the next level:

Page 37: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

37

The 3rd algorithm variation

1. For i=1 to N do1. Leave out sample (row) i of the original expression

matrix X0.

2. Gene screening: select a set S-i of m genes, giving an expression matrix X-i of size N-1 x m.

3. Dimension reduction: Use MPLS to reduce X-i to T-i where T-i is of size N-1 x K.

4. Classification: Fit the classifier to the remaining N-1 samples and use the fitted classifier to predict the left out sample i.

Page 38: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

38

Class question

Q: What is the difference between the 2nd and 3rd variations?

A: The gene screening stage is fixed with respect to LOOCV in the 2nd variation, and isn’t in the 3rd variation.

That allows us to assess the error in the gene screening stage in the 3rd variation.

Page 39: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

39

About the 3 variations

The 3rd variation is the only one that allows us to check the correctness of out model.

Why? Because this is the only variation where we use

LOOCV to delete a sample from our input matrix, and then try to estimate it.

In the other two variations – we estimate a sample after we used it in our process.

Page 40: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

40

Results

Acute Leukemia DataNumber of samples: N = 72. Number of genes: p = 3490.The multi-class:

AML: 25 samples. B-ALL: 38 samples. T-ALL: 9 samples.

New reduced dimension: K = 3.

Page 41: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

41

Results (cont.) Notations:

Numbers in brackets: the number of times we demanded that the pairwise absolute mean difference will pass the critical score.

Numbers not in brackets: the number of genes that passed.

In A2 – the three numbers are the min-mean-max number of genes selected. (The Gene screening process selects differently every time)

Data: the error rate.

Best result: QDA.

Page 42: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

42

Article Criticism

The article does present a model that seems to be appropriate to solve the problem.

However, results show that there is a certain error rate. (About 1/20).

The article was not clear on several subjects.

Non the less, it was interesting to read.

Page 43: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

43

Questions?

Thank you for listening.

Page 44: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

44

References

The article: Multi-class cancer Classification via partial least squares with gene expression profiles by Danh V. Nguyen and David M. Rocke.

Student’s t distribution – http://en.wikipedia.org/wiki/T_distribution

Student t test: http://www.socialresearchmethods.net/kb/stat_t.htm

LOOCV: http://www-2.cs.cmu.edu/~schneide/tut5/node42.html

Page 45: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

45

Appendix - Polychotomous Discrimination – explicit explanation.

Why we define? To avoid calculating Explanation: Remember that So:

So we don’t have to calculate

|log

1|

r

P y rg x

P y

x

x

( )P x

|

1|

P y r

PP y r

P y

x

xx

x 1P y

P

xx

1

P y r

P y

x

x

|P A B

P A BP B

( ).P x

Page 46: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

46

PD

We assume we can write

Remembering that we can get to

This is our polychotomous regression model. Next, we assign beta to that formula (Replacing

with ).

0 1 21 2

|...

1|

p

Tr r r r r p r

P y rg x x x

P y

xx x

x

0

exp|

1 expr

kr r

gP y r

g

xx

x

rg x Trx

0

exp

1 exp

Tr

k Tr r

x

x

Page 47: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

47

PD

Next, we define

This holds our whole model. Now we want to maximize beta using MLE –

Maximum Likelihood Estimation. We’ll describe how to do that.

1,...,T T TK

Page 48: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

48

PD

Defining a notation: Now, re-writing the formula from the two slides

back: So, by taking log, we get:

Next, define a row of indicators for a sample Where and

Where states if the sample belongs to a type cancer

1, log 1 Gi r r ic gb S x x

| exp ,i r i iP y r g c b x x x

log | ,i r i iP y r g c b x x x

0,...,

G

Ti i iz zz

riz ix

1,..., .r G 0 1.r ri iz or z

ix 0 1

, ..,pi i i ix x xx

01.ix

Page 49: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

49

PD

Now, Define a matrix: Notice that Meaning that in every row

of , the sample was classified to exactly one cancer class.

Using these notations, we conclude that the likelihood for N independent samples is:

1. ,...,N G NM Z Z z z

1. 1.r

Gr ii zS

Z ix

1 2

1

1 1

1| 2| ... |

|

i i iG

ir

Nz z z

i i ii

N Gz

ii r

L P y P y P y G

P y r

b

x x x

x

Page 50: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

50

PD

Taking log, we get the log-likelihood (which is easier to compute).

1 21

1 1

[ log 1| log 2| ...

log | ]

log |

G

r

N

i i i ii

i i

N G

i ii r

l z P y z P y

z P y G

z P y r

b

x x

x

x

Page 51: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

51

PD

Next, remembering that we get that

Now, this expression can be maximized to achieve the MLE using the Newton-Raphson method.

One of the cases that the MLE exists is if there exists a vector such that where index set identifying all samples in class r.

1. 1,r

Gr ii zS

1 2

log | ,r

N G

i i ii r

l z P y r cb b

x x

0,..., Gb b b . . 0T

r r j ii C j b b x

rC

Page 52: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

52

Appendix References

Article appendices: http://dnguyen.ucdavis.edu/.html/SUP_cla2/SupplementalAppendix.pdf

Newton Raphson method - http://en.wikipedia.org/wiki/Newton-Raphson_method

On the Existence of Maximum Likelihood Estimates in Logistic Regression Models. (A. Albert; J. A. Anderson 1984). http://www.qiji.cn/eprint/abs/2376.html

Page 53: 1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want

53

AbstractThis presentation deals with multi-class cancer classification – The process of classifying samples into multiple types of cancer. The article describes a 3-phase algorithm scheme to demonstrate the classification. The 3 phases are Gene Selection, Dimension reduction and Classification. We present one example of gene selection method, one example of a dimension reduction method (MPLS), and two classification methods (PD and QDA), which we then compare between.The presentation also presents concepts like class, multi-class, t-test, and LOOCV.