Download - Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression
Robust Prediction of Cancer Disease Using Pattern
Classification of Microarray Gene-Expression Data
Presented by-
Md. Mushfiqur Rahman
Researcher
Bioinformatics Lab.
Dept. of Statistics, R.U.
E-mail: [email protected]
Md. Matiur Rahaman1,2, Md. Mushfiqur Rahman2, Md. Nurul Haque Mollah2 and Ming Chen1
1. Department of Bioinformatics, College of Life Sciences, Zhejiang University, Zijingang Campus, Hangzhou 310058, China.
2. Bioinformatics Lab, Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh.
International Conference on Applied Statistics (ICAS)
The Institute of Statistical Research and Training (ISRT)
University of Dhaka, Dhaka 27-29 December 2014 Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.
1
Welcome to presentation
on
Outlines
1. Introduction to Gene-Expression Data.
2. Robust Classifier.
3. Performance Investigation of Robust Classifiers using
Simulated Data.
4. Performance Investigation using Simulated Gene-
Expression Profile for Prediction of Cancer Disease.
5. Performance Investigation using Real Gene-Expression
Profile for Prediction of Cancer Disease.
6. Conclusion.
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
2
Introduction to Gene-Expression Data
• Expression level of genes in an individual that is measured through
Microarray is called Gene-Expression data. Each data point produced by a
DNA microarray hybridization experiment represents the ratio of expression
levels of a particular gene under two different experimental conditions.
Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.
3
Gene Expression
Microarray Technology and Gene Expression Data
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
4
Example of Gene-Expression Data
Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.
5
Genes
mRNA samples
sample1 sample2 sample3 sample4 sample5 …
1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ...
Gene expression level of gene i in mRNA sample j
= Log( Red intensity / Green intensity)
A Complete workflow for Gene-Expression data analysis
Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.
6
• Workflow for real microarray gene expression data classification-
Hierarchical Clustering
Partition-Based
Clustering
Divisive Methods
(Top - Down)
Agglomerative methods
(Bottom - Up)
1. Single Linkage Clustering / Nearest Neighbor Technique
2. Complete Linkage Clustering
3. Average Linkage Clustering
4. Ward's Hierarchical Clustering
5. Centroid Method
6. Median Method
7. And so on
Different Classification
Unsupervised classification
(Clustering)
Supervised classification
1.Bayes classifier.
2.Maximum likelihood classifier.
3. FLDA (Fisher Linear
Discriminate Analysis)
4. SVM (Support Vector Machines)
5. Decision Trees
6. K-NN (K-Nearest Neighbors)
7. AdaBoost .
8. Robust Classifier (Proposed)
9.And so on.
1. K-Means Clustering
2. Fuzzy Clustering
3. Model Based Clustering
4. And so on
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
7
Bayes Classifier
Bayes classifier: Classify objects to a class with probability.
Foundation: Based on Bayes Theorem.
A short note on Bayes classifier under normal populations
Let π1 ,…, πm be m normal populations .
Let {xi (k) ~ , i=1,2, …, Nk ; k=1,2, …, m} be the training data set.
Objective is to classify a new data vector (or test data vector) x into one of
k populations π1, … , πm .
Let the prior probability of be qk which is known.
Then the posterior probability of is defined by,
Where, fk (x) = be the pdf of πk .
),( )( VN kp
kx
kx
),( )( VN kp
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
8
Bayes Classifier (Cont…)
Then the classification region Rk is defined for classifying x to the population
Πk as follows:
Discriminant
function
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
9 This is known as Bayes classifier to classify an object x to the population Πk
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
10
Bayes Classifier (Cont…)
• Traditional Bayes procedure may produce misleading results in presence of outliers in
the training dataset or test dataset or in both datasets.
• To improve the results, one can replace MLEs by the robust estimators like MVE
(Rousseeuw et al.,1985) , MCD (Rousseeuw et al.,1985) and OGK (Maronna and
Zama 2002) estimators.
• But the performance of this robust procedures are not so good in the case of high
dimensional dataset.
Also these estimators may not control the influence of contaminated test vector (x).
• To overcome this problem, an attempt is made to Robustify the Bayes procedures by
minimum β−divergence method (Mollah et al., 2007, 2010).Which is our proposed
method.
Robust Bayes classifier
Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.
11
• The minimum β-divergence estimator 𝜇 𝛽(𝑘)
and 𝑉 𝛽(𝑘)
for the mean vector μ(k)
and the covariance matrix V(k) respectively are obtained iteratively as
follows:
𝜇𝑟+1(𝑘)
= 𝜙𝛽 𝒙𝑖
(𝑘);𝜇𝑟
(𝑘),𝑉𝑟
(𝑘)𝒙𝑖(𝑘)𝑛𝑘
𝑖=1
𝜙𝛽 𝒙𝑖(𝑘)
;𝜇𝑟(𝑘)
,𝑉𝑟(𝑘)𝑛𝑘
𝑖=1
and, 𝑉𝑟+1(𝑘)
= 𝜙𝛽 𝒙𝑖
(𝑘);𝜇𝑟
(𝑘),𝑉𝑟
(𝑘)𝜓(𝒙𝑖
(𝑘);𝜇𝑟
(𝑘))
𝑛𝑘𝑖=1
𝛽+1 −1 𝜙𝛽 𝒙𝑖(𝑘)
;𝜇𝑟(𝑘)
,𝑉𝑟(𝑘)𝑛𝑘
𝑖=1
where,
• 𝜙𝛽 𝒙𝑖(𝑘)
; 𝜇𝑟(𝑘)
, 𝑉𝑟(𝑘)
= 𝑒𝑥𝑝 −𝛽
2(𝒙𝑖
𝑘−𝜇𝑟
(𝑘))𝑇𝑉𝑟
(𝑘)−1(𝒙𝑖
𝑘−𝜇𝑟
(𝑘)) is β-
weight function & 𝜓(𝒙𝑖(𝑘)
; 𝜇𝑟(𝑘)
) = (𝒙𝑖𝑘−𝜇𝑟
(𝑘)) (𝒙𝑖
𝑘−𝜇𝑟
(𝑘))𝑇
• β=0, these estimators reduces to classical non-iterative estimates.
Robust Bayes Classifier (Cont…)
Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.
12
• Step-1: First, we calculate β-weight for the test vector (x) using the β-weight function-
and then we construct a criteria to test the data vector is contaminated or not as
follows:
• The 𝛽- weight function plays the significant role for robustification of Bayes classifier as discussed follow-
Robust Bayes Classifier (Cont…)
Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.
13
Robust Bayes Classifier (Cont…)
14
Step 2: : If the unclassified data vector x is contaminated by outliers, we calculate the absolute difference between the contaminated vector and each mean vector as-
𝐝𝑘𝑖 = abs 𝒙𝑖 − 𝜇 𝑖,𝛽𝑘
; 𝑖 = 1,2, … , 𝑝,
Compute sum of the smallest r components of dk as
Sk = dk(1) + dk(2) + . . . + dk(r)
where r=round(p/2). Then find the tentative class or population for the unclassified data vector x as-
k =𝑎𝑟𝑔𝑚𝑖𝑛𝑆𝑘
𝑘
Then some or all components of the unclassified contaminated data vector x corresponding to dk(r+1), dk(r+2), ... ,dk(p) are assumed to be corrupted by outliers.
Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.
Performance Investigation of Robust Classifiers using Simulated Data
Both contamination
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
15
No contamination
Application of the Proposed Method for Gene Expression
Data Analysis
Gene Expression Data Generating Model
No
wak
an
d T
ibsh
ira
ni
(20
08)
Bio
sta
tist
ics.
9, 3, 46
7-4
83
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
16
Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease
Two Class Gene Classification (Absence of Outliers)
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
17
Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease
Two Class Gene Classification (Presence of Outliers)
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
18
Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease
Three Class Gene Classification (Absence of Outliers)
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
19
Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease
Three Class Gene Classification (Absence of Outliers)
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
20
Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease
(No Contamination)
Box Plot For Cancer Individuals Classification
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
21
Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease
(Train Data Contamination)
Box Plot For Cancer Individuals Classification
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
22
Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease
(Test Data Contamination)
Box Plot For Cancer Individuals Classification
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
23
Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease
(Both Data Contamination)
Box Plot For Cancer Individuals Classification
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
24
Real Gene Expression Data Analysis Head and Neck Cancer Data
(Kuriakose et al., 2004)
12,625 genes , 22 Normal Patient, 22 Cancer Patient
594 DE Genes of
12,625 Genes,
Calculated
by
EBarrays Method
Training gene-set ½ of DE
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
25
Performance Investigation using Real Gene-Expression Profile for Prediction of Cancer Disease
(In absence of outlier)
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
26
Performance Investigation using Real Gene-Expression Profile for Prediction of Cancer Disease
(In Presence of outlier)
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.
27
Conclusion
Bayes procedure is a popular tool for classification. However, the
traditional Bayes procedure is very much sensitive to outliers. So
we discuss a robustification of Bayes procedure by β-divergence
(Mollah et al., 2007, 2010).
We compare our proposed method with some popular
classification methods (SVM, KNN, AdaBoost, those are use for
Microarray gene expression data analysis) using simulated datasets
and we observe that the performance of our proposed method is
better than all comparable methods as early mentioned.
We have checked the performance of proposed method in
simulated and real both gene-expression data analysis. From the
above discussion simulation and real data results shows that the
proposed method significantly improves the performance over the
traditional Bayes methods in presence of outliers; otherwise, it
keeps equal performance.
Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU
28
Anderson,T.W.(2003): An Introduction to Multivariate Statistical Analysis,Wiley Interscience
Johnson, R.A., Wichern, D.W. (2007): Applied multivariate statistical analysis, Sixth edition, Prentice-Hall.
Mollah,M.N.H., Minami,M. and Eguchi, S. (2007): Robust prewhitening for ICA by minimizing beta-
divergence and its application to FastICA. Neural processing Letters,25(2), pp. 91-110.
Mollah, M.N.H.,Sultana,N., Minami, M. and Eguchi, S. (2010): Robust extraction of local structures by the
minimum β-divergence method. Neural Networks, 23, pp. 226-238.
Wang,S.,Gui,j. and Li,X. (2008): Factor analysis for cross-platform tumer classification based on gene
expression profiles. Journal of Circuits,Systems,and Computers, 19, pp. 243-258.
Wuju L. and Momiao X.(2002): Tumor classification system based on gene expression profile.
Bioinformatics, 18(2): pp. 325-326.
Wright G.,Tan B., Rosenwald A., Hurt E., Wiestner A. and Staudt L. (2003): A gene expression-based
method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proc Natl Acad Sci USA,
2003, 100:9991-9996.
Nowak, G. and Tibshirani, R. (2008) Complementary Hierarchical Clustering. Biostatistics. 9, 3, 467-483.
Veer, L.J. et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530-
536.
References
Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.
29
Thank you for Listening.
Supported by HEQEP (CP-3603.R3.W2) and
Bioinformatics Lab., Dept. of Statistics,
University of Rajshahi.