principal component analysis based unsupervised feature extraction applied to bioinformatics...

Principal component analysis based unsupervised feature extraction applied to bioinformatics analysis

Yh. Taguchi

Department of Physics

Chuo University

Tokyo

Japan

What is PCA based unsupervised FE?

N features

Categorical multiclasses

In contrast to usual usage of PCA, not samples but features are embedded into Q dimensional space.

PC

A

PC1

samplesPC Loadings

M samplesN × M Matrix X (numerical values)

PC2

PC1

PC Score

++ ++ +

+++

++ ++ ++

+

No distinction between classes

Synthetic example

10 samples10 samples

90 features 10 featuresN(0)N()

[N()+N(0)]/2

+:Top 10 outliersThus, extracting outliers selects features distinct between two classes in an unsupervised way.Accuracy:(100 trials)Accuracy:(100 trials) 89.5% ( 52.6% (

PC1

PC2

Normal μ：mean Distribution ½ :SD

2. Example 1: Identification of cell cycle regulated genes from gene expression

(Yh. Taguchi, Biodata mining, 2016)

AIM:

Identification of genes that contribute to periodic cell division cycle. Period is known. But which periodic functin? Usually, sinusoidal is assumed and sinusoidal regression was used to select genes. but is it correct?

cell division cycle

Synthetic example

Blue：original sinusoidal function　Red:laterBlack：sinusoidal function + periodic function

(t)=(t+T(=25))）100time points(4 period)×104 genes100genes：random linear combination of two blacks9,900genes: noisesTask：Identify 100 genes with no information

ϵ jS ,ϵ j

C∈[−A , A] ,δ i∈[0,2π] ,ϵij∈[−1,1]

Sinusoidal wave

Addition of periodic functions

orthogonalization

Vector generation of 100 random periodic and 9,900 pure noise

normalization

i: genesj: time

baseG

ene expression

100genes 9,900genesPC scores (genes) PC loadings (time）

(red lines two pages before）black>red>gren>blue

Assumption: PC score obeys GausssianPvalues : by squared distribution adjusted by χ →

BenjaminiHochberg Qvalue→

Q<0.01

Comparison between sinusoidal fitting & PCA based unsupervised FEA:Ratio of periodic function to sinusoidal fitting

PCA

Sinusoidal fitting

Sinusoidal fitting

Q Q Q Q Q Q

Q Q Q Q Q Q

Q Q Q Q Q Q

Q Q Q Q Q Q

PCA

100 % distinction between synthetic periodic and aperiodic profiles.

How about real data?

Real Data: IReal Data: Identification of cell cycle regulated genes ofdentification of cell cycle regulated genes ofSynchronization is required budding yeast budding yeastStrategy 1： food restriction （metabolic cycle）

Scatter plot of PC1 to PC4 loading (time)

Numbers are winding number around centerConsider PC2/PC3

Are genes selected with PC2/PC3 biologically feasible?

PC scores(genes) PC loading(time)

Blackredgreen are selected geness(P<0.01)ribosomemitochondria　→match with Cell division original paper

Differ from sinusoidal wave!

REACTOME (Selected by PC1 to PC4）PCA based unsupervised FE is better than all fittings using sinusoidal, rectangular and triangular waves in selecting biologically feasible genes. Since PC2 differs from PC3, no periodic fitting can work.

fitting

Biological feasibility

Take home messages:

Gene expression profiles are periodic, but not sinusoidal. Thus, sinusoidal regression might cause artifacts (But there will be no ways to assume true one a priori！）

Limit cycle can be identified without functional forms or period.

Biologically important three gene clusters can be identified in unsupervised way.

→Superiority of PCA based unsupervised FE

3. Example 2: Identification of miRNAmRNA interaction from gene expression

(Yh. Taguchi, IJMS, 2016)

What is microRNA (miRNA)？

DNA

mRNA

protein

miRNA

Difficulty of inference of miRNAmRNA interaction

＊too many pairs mRNA 〜 104, miRNA 〜 103 → pairs 〜 107 ＊Computational prediction is sequence based

How to solve this problem?Prescreening mRNA/miRNA based on

differential expression (DE)　Ex.：functional miRNAmRNA pairs in disease

→ mRNA/miRNA with significant DE: Normal vs Patients

mRNA miRNA

normal

patients

matching

Negativecorrelation

normal

patients

Problem：　””significant DE” significant DE” is arbitrary

Screening criteria： Pvalue+Fold Change:FC

Pvalue：Fixed number of mRNA/miRNA, NVariable sample numbers：M　M:large → P:small

FC：Typical thershold: 2 or ½, but any basis?

Example previous researchessignificant DEsignificant DE

cancers

Previous studies

None

No mention

In real studies....Control Pvalue and FC → good resultsfeasibility → no discussions

If biologically feasible, no broblem?If biologically feasible, no broblem?

（No discussion about Pvalue and FC）

→”Which ones are DE mRNA/miRNA?”

→True answer exist (but unknown)

→ Data driven strategy can help us

IdeaIdea：：PCA based unsupervised FEPCA based unsupervised FE

Fixed number of mRNA/miRNA, N,M：variable, what is convergent as M → ∞?

⇓Distributions of PC score(genes) should converge as M → ∞ .

M(≪N)sampleG

ene expression profile

PC loading(Converge M )→ ∞

normal

patientsPC1M

N

PC1

PC2

Gaussian（assumed）

cf.Prob. PCA

PC scoresoutliers＊

　｜｜selected

significance：T test：P<0.05

＊:multiple normal＋χ２ distBH corrected P value<0.01

N(m

RN

A/m

iRN

A)

mRNA miRNA

mRNAsmiRNAs

outliers

TargetScan

Feature embedding

MiRNAmRNA

pairs

Reciprocal pairs

vs

Expression matrix

Controls

Patients

Sequence based miRNAmRNA interaction prediction

Data sets and selections

miR

NA

/mR

NA

selected by PC

A based

unsupervised FE

１

２

３

４

５

６

patients normal selected not selectedSamples Probes

Linear discriminat analysis with PC loadings(LOOCVLinear discriminat analysis with PC loadings(LOOCV））

normal normal

normal normal

normal normal

normal

normal

normal

Successful discrimination suggests successful selection of DE miRNA/mRNA

normal

normal normal

normal

normal normal

normal

normal normal

miRNA/mRNA pair infereanced

(numbers)：＊：starbase reports miRNA/mRNA negative corr（in any cancer）†：previous studies related to corresponding cancers

＊ † †paircancer

Conclusions in this section：Successful identification of miRNA/mRNA interaction based on integrated analysis of gene expression and sequence information, for multiple cancers, multiple platforms, multiple research groups, by unique criterion. Identified miRNA,mRNA,miRNA/mRNA pairs have relation with cancers, coincident with previous studies.

Benefit of unique criterion over multiple cancers:・Decreases sample bias is expected.・It can results in “no detect”.・Heterotic data sets (mixture of multiple platforms research groups) can be analyzed. Increased research possibilities.

Previous studies:None were interested in...

4. Example 3: Inference of nonsmall cell lung cancer epigenetic therapy target gene

(Yh. Taguchi et al, BMC Med. Geno., 2016)

Non small cell lung cancer (NSCLC) was lethal cancer, whose five year survival rate is at most less than 50%.

Recently, epigenetic therapy that target epigenetic regulation of genes raised as a new therapy strategy for NSCLC

In contrast to the many in vivo studies, there exists a relatively small number of in vitro studies.

The reason why there are small number of studies is because in vitro treatment of NSCLC is often failed to be reproduced.

The lack of in vitro study keeps us from identifying target genes of epigenetic therapy

Instead of direct investigation of in vitro effect of epigenetic therapy, we considered reprogrammed NSCLC cell line where epigenetic profiles is expected to be altered.

Samples (GSE35913)

H1 (ES cell), H358 and H460 (NSCLC),IMR90

(Human Caucasian fetal lung fibroblast), iPCH358, iPCH460, iPSIMR90

(reprogrammed cell lines),piPCH358 (redifferentiated iPCH358)

With three biological replicates. 3 x 8 = 24 samples

for gene expression/promoter methylation

Aim: Identify genes associated with both DE and aberrant promoter methylation (for biological feasibility)

Since it is categorical multiple class data set, usual control vs treated data set strategy are not useful (We do not know which pairs among eight classed should be compared) .

However, PCA based unsupervised FE has potential to treat it, since it does not require any predefined criterion to select genes.

To identify PC loadings used for FE, we applied hierarchical clustering of PC loadings

Gene expression

Promotermethylation

PC3 vs PC3

PC4 vs PC4

We aimed to find PC loadings associated with high correlation between gene expression and promoter methylation.

Distance = |correlation coef.|

PC3 loadings (H358 // H460)

Gene expression Promoter methylation

PC4 loadings (H358 // H460)

Gene expression Promoter methylation

Although we simply require PC loading must be highly correlated between gene expression and promoter methylation, this allowed us to identify PC loading that are DE between NSCLC cell lines and reprogrammed ones.

In addition to this, parallel as well as antiparallel between two NSCLC cell lines are identified.

These are only possible to employ unsupervised methods.

Categorical regression based FE (ANOVA)G

ene

expr

essi

onP

rom

oter

met

hyla

tion

Category

Features more coincident with multple categories are selected

Method to be compared with.....

Our strategyOur strategy

Gene expression

Promoter methylation

FE

FE

Top 300 significant genes

Top 300 significant genes

Genes selected commonly in both FE

(A) Associations with cancer related genes reported by Gendoo server. (B) Significant negative correlations (P<0.05) between gene expression and promoter methylation. (C) At least one study reported a direct/indirect relationship withNSCLC. (D) At least one study reported a direct/indirect relationship with Wnt/ catenin signalling pathways.β

(PC3)

Conclusions in this talkConclusions in this talk

PCA based unsupervised FE could

identify periodic genes without knowledge about periodicity,

prescreen DE mRNA/miRNA prior to miRNAmRNA identification inference,

identify genes associated with significant DE and aberrant promoter methylation within categorical multi class data set.

Thus, it is very useful....

Our strategy successfully identify genes associated with both DE and aberrant promoter expression.

They are biologically feasible.

PCA based unsupervised FE identified genes distinct from those identified by ANOVA, the standard methods applicable to categorical multi class data set

principal component analysis based unsupervised feature extraction applied to bioinformatics...

Science