principal component analysis based unsupervised feature extraction applied to bioinformatics...
TRANSCRIPT
Principal component analysis based unsupervised feature extraction applied to bioinformatics analysis
Yh. Taguchi
Department of Physics
Chuo University
Tokyo
Japan
What is PCA based unsupervised FE?
N features
Categorical multiclasses
In contrast to usual usage of PCA, not samples but features are embedded into Q dimensional space.
PC
A
PC1
samplesPC Loadings
M samplesN × M Matrix X (numerical values)
PC2
PC1
PC Score
++ ++ +
+++
++ ++ ++
+
No distinction between classes
Synthetic example
10 samples10 samples
90 features 10 featuresN(0)N()
[N()+N(0)]/2
+:Top 10 outliersThus, extracting outliers selects features distinct between two classes in an unsupervised way.Accuracy:(100 trials)Accuracy:(100 trials) 89.5% ( 52.6% (
PC1
PC2
Normal μ:mean Distribution ½ :SD
2. Example 1: Identification of cell cycle regulated genes from gene expression
(Yh. Taguchi, Biodata mining, 2016)
AIM:
Identification of genes that contribute to periodic cell division cycle. Period is known. But which periodic functin? Usually, sinusoidal is assumed and sinusoidal regression was used to select genes. but is it correct?
cell division cycle
Synthetic example
Blue:original sinusoidal function Red:laterBlack:sinusoidal function + periodic function
(t)=(t+T(=25)))100time points(4 period)×104 genes100genes:random linear combination of two blacks9,900genes: noisesTask:Identify 100 genes with no information
ϵ jS ,ϵ j
C∈[−A , A] ,δ i∈[0,2π] ,ϵij∈[−1,1]
Sinusoidal wave
Addition of periodic functions
orthogonalization
Vector generation of 100 random periodic and 9,900 pure noise
normalization
i: genesj: time
baseG
ene expression
100genes 9,900genesPC scores (genes) PC loadings (time)
(red lines two pages before)black>red>gren>blue
Assumption: PC score obeys GausssianPvalues : by squared distribution adjusted by χ →
BenjaminiHochberg Qvalue→
Q<0.01
Comparison between sinusoidal fitting & PCA based unsupervised FEA:Ratio of periodic function to sinusoidal fitting
PCA
Sinusoidal fitting
Sinusoidal fitting
Q Q Q Q Q Q
Q Q Q Q Q Q
Q Q Q Q Q Q
Q Q Q Q Q Q
PCA
Real Data: IReal Data: Identification of cell cycle regulated genes ofdentification of cell cycle regulated genes ofSynchronization is required budding yeast budding yeastStrategy 1: food restriction (metabolic cycle)
Scatter plot of PC1 to PC4 loading (time)
Numbers are winding number around centerConsider PC2/PC3
Are genes selected with PC2/PC3 biologically feasible?
PC scores(genes) PC loading(time)
Blackredgreen are selected geness(P<0.01)ribosomemitochondria →match with Cell division original paper
Differ from sinusoidal wave!
REACTOME (Selected by PC1 to PC4)PCA based unsupervised FE is better than all fittings using sinusoidal, rectangular and triangular waves in selecting biologically feasible genes. Since PC2 differs from PC3, no periodic fitting can work.
fitting
Biological feasibility
Take home messages:
Gene expression profiles are periodic, but not sinusoidal. Thus, sinusoidal regression might cause artifacts (But there will be no ways to assume true one a priori!)
Limit cycle can be identified without functional forms or period.
Biologically important three gene clusters can be identified in unsupervised way.
→Superiority of PCA based unsupervised FE
3. Example 2: Identification of miRNAmRNA interaction from gene expression
(Yh. Taguchi, IJMS, 2016)
Difficulty of inference of miRNAmRNA interaction
*too many pairs mRNA 〜 104, miRNA 〜 103 → pairs 〜 107 *Computational prediction is sequence based
How to solve this problem?Prescreening mRNA/miRNA based on
differential expression (DE) Ex.:functional miRNAmRNA pairs in disease
→ mRNA/miRNA with significant DE: Normal vs Patients
mRNA miRNA
normal
patients
matching
Negativecorrelation
normal
patients
Problem: ””significant DE” significant DE” is arbitrary
Screening criteria: Pvalue+Fold Change:FC
Pvalue:Fixed number of mRNA/miRNA, NVariable sample numbers:M M:large → P:small
FC:Typical thershold: 2 or ½, but any basis?
In real studies....Control Pvalue and FC → good resultsfeasibility → no discussions
If biologically feasible, no broblem?If biologically feasible, no broblem?
(No discussion about Pvalue and FC)
→”Which ones are DE mRNA/miRNA?”
→True answer exist (but unknown)
→ Data driven strategy can help us
IdeaIdea::PCA based unsupervised FEPCA based unsupervised FE
Fixed number of mRNA/miRNA, N,M:variable, what is convergent as M → ∞?
⇓Distributions of PC score(genes) should converge as M → ∞ .
M(≪N)sampleG
ene expression profile
PC loading(Converge M )→ ∞
normal
patientsPC1M
N
PC1
PC2
Gaussian(assumed)
cf.Prob. PCA
PC scoresoutliers*
||selected
significance:T test:P<0.05
*:multiple normal+χ2 distBH corrected P value<0.01
N(m
RN
A/m
iRN
A)
mRNA miRNA
mRNAsmiRNAs
outliers
TargetScan
Feature embedding
MiRNAmRNA
pairs
Reciprocal pairs
vs
Expression matrix
Controls
Patients
Sequence based miRNAmRNA interaction prediction
Data sets and selections
miR
NA
/mR
NA
selected by PC
A based
unsupervised FE
1
2
3
4
5
6
patients normal selected not selectedSamples Probes
Linear discriminat analysis with PC loadings(LOOCVLinear discriminat analysis with PC loadings(LOOCV))
normal normal
normal normal
normal normal
normal
normal
normal
Successful discrimination suggests successful selection of DE miRNA/mRNA
normal
normal normal
normal
normal normal
normal
normal normal
miRNA/mRNA pair infereanced
(numbers):*:starbase reports miRNA/mRNA negative corr(in any cancer)†:previous studies related to corresponding cancers
* † †paircancer
Conclusions in this section:Successful identification of miRNA/mRNA interaction based on integrated analysis of gene expression and sequence information, for multiple cancers, multiple platforms, multiple research groups, by unique criterion. Identified miRNA,mRNA,miRNA/mRNA pairs have relation with cancers, coincident with previous studies.
Benefit of unique criterion over multiple cancers:・Decreases sample bias is expected.・It can results in “no detect”.・Heterotic data sets (mixture of multiple platforms research groups) can be analyzed. Increased research possibilities.
Previous studies:None were interested in...
4. Example 3: Inference of nonsmall cell lung cancer epigenetic therapy target gene
(Yh. Taguchi et al, BMC Med. Geno., 2016)
Non small cell lung cancer (NSCLC) was lethal cancer, whose five year survival rate is at most less than 50%.
Recently, epigenetic therapy that target epigenetic regulation of genes raised as a new therapy strategy for NSCLC
In contrast to the many in vivo studies, there exists a relatively small number of in vitro studies.
The reason why there are small number of studies is because in vitro treatment of NSCLC is often failed to be reproduced.
The lack of in vitro study keeps us from identifying target genes of epigenetic therapy
Instead of direct investigation of in vitro effect of epigenetic therapy, we considered reprogrammed NSCLC cell line where epigenetic profiles is expected to be altered.
Samples (GSE35913)
H1 (ES cell), H358 and H460 (NSCLC),IMR90
(Human Caucasian fetal lung fibroblast), iPCH358, iPCH460, iPSIMR90
(reprogrammed cell lines),piPCH358 (redifferentiated iPCH358)
With three biological replicates. 3 x 8 = 24 samples
for gene expression/promoter methylation
Aim: Identify genes associated with both DE and aberrant promoter methylation (for biological feasibility)
Since it is categorical multiple class data set, usual control vs treated data set strategy are not useful (We do not know which pairs among eight classed should be compared) .
However, PCA based unsupervised FE has potential to treat it, since it does not require any predefined criterion to select genes.
To identify PC loadings used for FE, we applied hierarchical clustering of PC loadings
Gene expression
Promotermethylation
PC3 vs PC3
PC4 vs PC4
We aimed to find PC loadings associated with high correlation between gene expression and promoter methylation.
Distance = |correlation coef.|
Although we simply require PC loading must be highly correlated between gene expression and promoter methylation, this allowed us to identify PC loading that are DE between NSCLC cell lines and reprogrammed ones.
In addition to this, parallel as well as antiparallel between two NSCLC cell lines are identified.
These are only possible to employ unsupervised methods.
Categorical regression based FE (ANOVA)G
ene
expr
essi
onP
rom
oter
met
hyla
tion
Category
Features more coincident with multple categories are selected
Method to be compared with.....
Our strategyOur strategy
Gene expression
Promoter methylation
FE
FE
Top 300 significant genes
Top 300 significant genes
Genes selected commonly in both FE
(A) Associations with cancer related genes reported by Gendoo server. (B) Significant negative correlations (P<0.05) between gene expression and promoter methylation. (C) At least one study reported a direct/indirect relationship withNSCLC. (D) At least one study reported a direct/indirect relationship with Wnt/ catenin signalling pathways.β
(PC3)
Conclusions in this talkConclusions in this talk
PCA based unsupervised FE could
identify periodic genes without knowledge about periodicity,
prescreen DE mRNA/miRNA prior to miRNAmRNA identification inference,
identify genes associated with significant DE and aberrant promoter methylation within categorical multi class data set.
Thus, it is very useful....