diagnosis of multiple cancer types by shrunken centroids of gene expression
DESCRIPTION
Diagnosis of multiple cancer types by shrunken centroids of gene expression. By Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Course: 550.635 Topics in Bioinformatics Presenter: Ting Yang Teacher: Professor Geman. Nearest Centroid Classification. - PowerPoint PPT PresentationTRANSCRIPT
Diagnosis of multiple cancer types by shrunken centroids of gene expression
Course: 550.635 Topics in Bioinformatics Presenter: Ting YangTeacher: Professor Geman
By Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu
Nearest Centroid Classification
Example: small round blue cell tumors of childhood
• 63 training samples, 25 testing samples
• 4 classes: BL, EWS, NB, RMS
• Figure 1
• Nearest centroid classification
• Disadvantage
Nearest shrunken Centroids
• A modification of the nearest centroid method
• Idea: First normalize class centroids by the within-class standard deviation for each gene, shrink each class centroid towards the overall centroid.
Details:
0( )ik i
ikk i
x xd
m s s
Mean expression value in class k for gene i
ith component of the overall centroid
Pooled within class standard deviation for gene i
:t statistics
1 1k
k
mn n
:t statistics0( )
ik iik
k i
x xd
m s s
• It measures the difference between the gene i in class k and gene i in all classes combined.
• Idea: a gene that discriminates one class from the rest will have a statistic of large absolute value.
• Shrink it toward zero to eliminate the genes that do not provide sufficient information.
• ‘De-noising’ step
( )( )ik ik ikd sign d d
Choosing the amount of shrinkage• Shrinkage amount is allowed to vary over a wide range.
• 10-fold cross validation ( choose the one that has the smallest error rate)
• Divide the set of samples (at random)into 10 equal size parts.
(classes were distributed proportionally among each of the 10 parts)
• Fit the model on 90% of the samples and then predict the class label of the remaining 10% (test samples).
• Repeat 10 times, add together the error (overall error).
• Figure 2
• Figure 1
More Figures
• Figure 3
• Figure 4
Classification
• A new sample is classified by comparing its expression profile with each shrunken centroid, over those 43 active genes.
• Distance function: prior information included.
Statistical details:
• t-statistic
• Estimates of the class probabilities (Figure 5)
0( )ik i
ikk i
x xd
m s s