1 multi-class cancer classification noam lerner. 2 our world – very generally genes. gene samples....
Post on 21-Dec-2015
213 views
TRANSCRIPT
1
Multi-Class Cancer Classification
Noam Lerner
2
Our world – very generally
Genes. Gene samples. Our goal: classifying the samples.
Example: We want to be able to determine if a certain sample belongs to a certain type of cancer.
3
Our problem
Say that we have p genes, and N samples.
Normally, p<N, so it’s easy to classify samples.
What if N<p?
4
The algorithm scheme
Gene screening. Dimension reduction. Classification.
We’ll present 3 variations of this algorithm scheme.
5
Before gene screening - Classes
Normally, a class of genes is a set of genes that behave similarly under certain conditions.
Example: One can divide genes into a class of genes that indicate a certain type of cancer, and to another class of genes, that do not indicate.
Taking it one step further:
6
Multi-classes
Diving a group of genes into two or more classes is called a “Multi-class”.
What is it good for?Distinguishing between types of cancer.Example: Leukemia:
AML B-ALL T-ALL
7
Gene Screening
Generally, gene screening is a method that is used to disregard unimportant genes.
Example: gene predictors.
8
The Gene Screening process
Suppose we have G classes that represent G types of cancer. (We know which genes belong in each class).
We compare every two classes pair-wise and see if the expression is greater than a certain critical score. ( is the mean of the r-th set of the multi set)
'r rx x
rx
9
What is the critical score?
MSE – mean squared error. - the size of the r-th multi set. t – arises from student’s t-distribution.
'
1 1t
r r
MSEn n
rn
10
Student’s t-distribution
t distribution is used to estimate the mean and variance of a normally distributed population when the sample size is small.
Fact: The t distribution depends on the size of the population, but not on the mean nor in the variance of the items in the population. The lack of dependence is what makes the t-distribution important in both theory and practice.
Anecdote: William S. Gosset published a paper on this subject, under the pseudonym student, and that’s how the distribution got its name.
11
The student t test
The t-test assesses whether the means of two groups are statistically different from each other.
This analysis is appropriate whenever you want to compare the means of two groups.
It is assumed that the two groups have the same variance.
12
The student t test (cont.)
Consider the next three situations:
13
The student t test (cont.)
The first thing to notice about the three situations is that the difference between the means is the same in all three.
We would want to conclude that the two groups are similar in the high-variability case, and the two groups are distinct in the low-variability case.
Conclusion: when we are looking at the differences between scores for two groups, we have to judge the difference between their means relative to the spread or variability of their scores. The student t-test does just this.
14
The student t test (cont.)
We say that two classes passed the student t test if the t is greater than a certain parameter.
Risk level: usually Degrees of freedom: Look it up in a table.
'
'
'
var varr r
r r
r r
x xt
n n
' 2.r rn n 0.05a
15
Dimension Reduction
It appears that we need more than Gene Screening.
Reminder: We have p genes, N samples, N<p. Most classification methods (the next phase of
the algorithm) assume that p<N. The solution: dimension reduction: reducing the
gene space dimension from p to K where K << N.
16
Dimension Reduction (cont.)
This is done by constructing K gene components and then classifying the cancers based on the constructed K gene components.
Multivariate Partial Least Squares (MPLS) is a dimension reduction method.
Example:
17
Example Reducing dimension from 35 to 3 (5
classes).
18
Example (cont.)
This is the NCI60 data set that contains 5 various types of cancer.
19
MPLS
Suppose we have G classes. Suppose y indicates the cancer classes 1,
…,G. We define a row for every sample:
Fix a K (our desired reduction dimension)
1
0i
ik
y ky
else
Y
20
MPLS (cont.)
Suppose X is the gene expression values matrix.
Suppose t1,…,tK are linear combination of X. Then, the MPLS finds (easily) two unit
vectors: w and c such that the following expression is maximized:
Then , the MPLS extracts t1,…,tK, and we are done.
2cov , .XwYc
21
Why maximizing the covariance?
If cov(x,y)>0 then y increases as x increases. If cov(x,y)<0 then y decreases as x
increases. By maximizing the covariance, we get that Yc
increases as Xw increases. That way, we get a good estimation of Yc by
Xw, and we have found our MPLS components: t1,…,tK.
22
Classification
After we have reduced the dimension of the gene space, we need to actually classify the sample(s).
It’s important to pick a classification method that will work properly after dimension reduction.
We’ll present two different methods: PD and QDA.
23
PD (Polychotomous Discrimination)
Recall the indicator y that indicates the cancer classes 1,…,G.
Set a vector . Then, the distribution of y depends on x.
(We think of y as a random variable). We also suppose that
1,..., px xx
1,..., . | 0r G P y r x
24
PD (cont.)
We define
After a few mathematical transitions we get that
This is the probability that a sample with gene expression profile x is of cancer class r.
|
log1|r
P rg
P
xx
x
0
exp|
1 expr
Krr
gP r
gS
xx
x
25
PD (cont.)
By looking at the previous formula through a certain mathematical model, we can maximize a parameter, that holds all the data.
The parameter can be maximized only if there are more samples (N) than parameters (p), and by using dimension reduction, we got just that.
b
26
PD (cont.)
So, instead of looking at we’ll look at the corresponding gene component profile, .
Now, let’s look at the new probabilities, that rely on the new: .
Finally, we’ll say that (and therefore ) belong to the r-th cancer class if
A more detailed explanation on PD is given on the presentation’s appendix.
1,...,p
px x x ¡
ˆ Kx ¡
x̂ ˆ ˆ|P r x
x
ˆ ˆ. | | .s r P s P r x x
27
QDA (Quadratic Discriminant Analysis) Recall the indicator y that indicates the
cancer classes 1,…,G. Consider the following multivariate normal
model (for each cancer class):
2| , .r ry r N m sx :
28
QDA (cont.)
Suppose is the classification of the r-th cancer class, then
Where is ‘s pdf function.
rC
| .pr s s r rC s r p f p f x x x¡
rp P y r
rf |y rx
29
QDA (cont.)
Again, instead of looking at
we’ll look at the corresponding gene component profile, , and get the desired classification.
1,...,p
px x x ¡
ˆ Kx ¡
30
Review - the big picture
Gene screening – allows us to get rid of genes that won’t tell us anything.
Dimension reduction – allows us to reduce the the gene space – and work on the data.
Classification – allows us to decide if a sample has a cancer of a certain multi-class.
31
Just before the algorithm
We would want a way to assess if we generated a correct classification.
In order to do that – we use LOOCV.
32
LOOCV
LOOCV stands for Leave Out One Cross Validation.
In this process, we remove one data point from our data, run our algorithm, and try to estimate the removed data point using our results, as if we didn’t know the original data point. Then, we assess the error.
This step is repeated for every data point, and finally, we accumulate the errors in some sort for a final error estimation.
33
The 1st algorithm variation
1. Gene screening: select a set S of m genes, giving an expression matrix X of size N x m.
2. Dimension reduction: Use MPLS to reduce X to T where T is of size N x K.
3. Classification: For i=1 to N do1. Leave out sample (row) i of T, 2. Fit the classifier to the remaining N-1 samples and
use the fitted classifier to predict the left out sample i.
34
The 2nd algorithm variation
1. Gene screening: select a set S of m genes, giving an expression matrix X of size N x m.
2. For i=1 to N do 1. Leave out sample (row) i of the expression matrix X
creating X-i
2. Dimension reduction: Use MPLS to reduce X-i to T-i where T-i is of size N x K.
3. Classification: Fit the classifier to the remaining N-1 samples and use the fitted classifier to predict the left out sample i.
35
Class question
Q: What is the difference between the 1st and 2nd variations?
A1: In the 1st variation, steps 1 and 2 are fixed with respect to LOOCV. Therefore, the effect of gene screening and dimension reduction on the classification cannot be assessed.
A2: In the 2nd variation, we can assess the effect of the dimension reduction.
36
More on the 1st variation
Results show that the 1st variation does not yield good results. (The classification error rates were more optimistic than the expected error rates.
Taking it to the next level:
37
The 3rd algorithm variation
1. For i=1 to N do1. Leave out sample (row) i of the original expression
matrix X0.
2. Gene screening: select a set S-i of m genes, giving an expression matrix X-i of size N-1 x m.
3. Dimension reduction: Use MPLS to reduce X-i to T-i where T-i is of size N-1 x K.
4. Classification: Fit the classifier to the remaining N-1 samples and use the fitted classifier to predict the left out sample i.
38
Class question
Q: What is the difference between the 2nd and 3rd variations?
A: The gene screening stage is fixed with respect to LOOCV in the 2nd variation, and isn’t in the 3rd variation.
That allows us to assess the error in the gene screening stage in the 3rd variation.
39
About the 3 variations
The 3rd variation is the only one that allows us to check the correctness of out model.
Why? Because this is the only variation where we use
LOOCV to delete a sample from our input matrix, and then try to estimate it.
In the other two variations – we estimate a sample after we used it in our process.
40
Results
Acute Leukemia DataNumber of samples: N = 72. Number of genes: p = 3490.The multi-class:
AML: 25 samples. B-ALL: 38 samples. T-ALL: 9 samples.
New reduced dimension: K = 3.
41
Results (cont.) Notations:
Numbers in brackets: the number of times we demanded that the pairwise absolute mean difference will pass the critical score.
Numbers not in brackets: the number of genes that passed.
In A2 – the three numbers are the min-mean-max number of genes selected. (The Gene screening process selects differently every time)
Data: the error rate.
Best result: QDA.
42
Article Criticism
The article does present a model that seems to be appropriate to solve the problem.
However, results show that there is a certain error rate. (About 1/20).
The article was not clear on several subjects.
Non the less, it was interesting to read.
43
Questions?
Thank you for listening.
44
References
The article: Multi-class cancer Classification via partial least squares with gene expression profiles by Danh V. Nguyen and David M. Rocke.
Student’s t distribution – http://en.wikipedia.org/wiki/T_distribution
Student t test: http://www.socialresearchmethods.net/kb/stat_t.htm
LOOCV: http://www-2.cs.cmu.edu/~schneide/tut5/node42.html
45
Appendix - Polychotomous Discrimination – explicit explanation.
Why we define? To avoid calculating Explanation: Remember that So:
So we don’t have to calculate
|log
1|
r
P y rg x
P y
x
x
( )P x
|
1|
P y r
PP y r
P y
x
xx
x 1P y
P
xx
1
P y r
P y
x
x
|P A B
P A BP B
( ).P x
46
PD
We assume we can write
Remembering that we can get to
This is our polychotomous regression model. Next, we assign beta to that formula (Replacing
with ).
0 1 21 2
|...
1|
p
Tr r r r r p r
P y rg x x x
P y
xx x
x
0
exp|
1 expr
kr r
gP y r
g
xx
x
rg x Trx
0
exp
1 exp
Tr
k Tr r
x
x
47
PD
Next, we define
This holds our whole model. Now we want to maximize beta using MLE –
Maximum Likelihood Estimation. We’ll describe how to do that.
1,...,T T TK
48
PD
Defining a notation: Now, re-writing the formula from the two slides
back: So, by taking log, we get:
Next, define a row of indicators for a sample Where and
Where states if the sample belongs to a type cancer
1, log 1 Gi r r ic gb S x x
| exp ,i r i iP y r g c b x x x
log | ,i r i iP y r g c b x x x
0,...,
G
Ti i iz zz
riz ix
1,..., .r G 0 1.r ri iz or z
ix 0 1
, ..,pi i i ix x xx
01.ix
49
PD
Now, Define a matrix: Notice that Meaning that in every row
of , the sample was classified to exactly one cancer class.
Using these notations, we conclude that the likelihood for N independent samples is:
1. ,...,N G NM Z Z z z
1. 1.r
Gr ii zS
Z ix
1 2
1
1 1
1| 2| ... |
|
i i iG
ir
Nz z z
i i ii
N Gz
ii r
L P y P y P y G
P y r
b
x x x
x
50
PD
Taking log, we get the log-likelihood (which is easier to compute).
1 21
1 1
[ log 1| log 2| ...
log | ]
log |
G
r
N
i i i ii
i i
N G
i ii r
l z P y z P y
z P y G
z P y r
b
x x
x
x
51
PD
Next, remembering that we get that
Now, this expression can be maximized to achieve the MLE using the Newton-Raphson method.
One of the cases that the MLE exists is if there exists a vector such that where index set identifying all samples in class r.
1. 1,r
Gr ii zS
1 2
log | ,r
N G
i i ii r
l z P y r cb b
x x
0,..., Gb b b . . 0T
r r j ii C j b b x
rC
52
Appendix References
Article appendices: http://dnguyen.ucdavis.edu/.html/SUP_cla2/SupplementalAppendix.pdf
Newton Raphson method - http://en.wikipedia.org/wiki/Newton-Raphson_method
On the Existence of Maximum Likelihood Estimates in Logistic Regression Models. (A. Albert; J. A. Anderson 1984). http://www.qiji.cn/eprint/abs/2376.html
53
AbstractThis presentation deals with multi-class cancer classification – The process of classifying samples into multiple types of cancer. The article describes a 3-phase algorithm scheme to demonstrate the classification. The 3 phases are Gene Selection, Dimension reduction and Classification. We present one example of gene selection method, one example of a dimension reduction method (MPLS), and two classification methods (PD and QDA), which we then compare between.The presentation also presents concepts like class, multi-class, t-test, and LOOCV.