protein subcellular localization ... - a-alaa.github.io · protein subcellular localization...

Protein Subcellular Localization PredictionBased on Internal Micro-similarities ofMarkovChainsMarkov Chains Latent Representation by Micro-similarities

25 July 2019

Asem Alaa1, Ayman Eldeib1, Ahmed A. Metwally1, 2

1Systems and Biomedical Engineering Department, Cairo University2Department of Genetics, Stanford University

Background: Protein Localization

Protein Localization: The biological process of accumulation of a protein at a given site.Other names: Protein Targeting or Protein Sorting.

www.uniprot.org - June 2019

Uncovering Protein Localization is essential in Target Identification forDrug Discovery

2

www.uniprot.org

Background: Protein Localization

Protein Localization: The biological process of accumulation of a protein at a given site.Other names: Protein Targeting or Protein Sorting.

www.uniprot.org - June 2019

Uncovering Protein Localization is essential in Target Identification forDrug Discovery2

www.uniprot.org

Background: The current prediction methods

+158 Million proteins entries.Strong demand for computational prediction methods.+90 computational method for predicting localization.

Methods by category

Rule-based systems (if-else).

Machine learning.

Common Drawbacks

Poor classification performance.

Loss of generality.

Lack of interpretability.What is interpretability?“Interpretability is the degree to which a human can understand the cause of adecision.” (Miller, 2018)

Why interpretability matters?Not only predict which location a protein is transported tobut also why it is transported there

3



Methods by category


Machine learning.

Common Drawbacks


Loss of generality.

Lack of interpretability.

What is interpretability?“Interpretability is the degree to which a human can understand the cause of adecision.” (Miller, 2018)


3



Methods by category


Machine learning.

Common Drawbacks


Loss of generality.

Lack of interpretability.What is interpretability?

“Interpretability is the degree to which a human can understand the cause of adecision.” (Miller, 2018)


3



Methods by category


Machine learning.

Common Drawbacks


Loss of generality.



3



Methods by category


Machine learning.

Common Drawbacks


Loss of generality.


Why interpretability matters?Not only predict which location a protein is transported to

but also why it is transported there

3



Methods by category


Machine learning.

Common Drawbacks


Loss of generality.



3

Our objectives

Study Aims:

Enhance classification performace of current methods.

Keep interpretable!

Findings:

Generative models: learn representations.

Dicriminative models: classify representations.

Markov models is interpretable and generative model.

Idea: Combine a generative model with a discriminative model

Now we have the advantage of interpretability and dicrimination power.

4

Table of contents

1 Introduction

2 Methodology

3 Experiments & Results

4 Conclusion

5

Introduction

Background: The classification problem of protein localization

In ML notions:

Let 𝑥 a protein sequence.Let 𝑦 be a subcellular site.

A classification tool estimates 𝑓(𝑥) = 𝑝(𝑦|𝑥) (ideally a probabilistic function). Forexample:

𝑓( ) =⎧{⎨{⎩

0.6 𝑦 =0.3 𝑦 =0.1 𝑦 =

6

Markov chains

𝑥 represents a given sequence of amino acids of length 𝐿: 𝑥 = 𝑎1𝑎2 … 𝑎𝑖 … 𝑎𝐿𝑛 amino acids (typically 𝑛 = 20): 𝑎𝑖 ∈ 𝑆 and 𝑆 = {𝑠1, 𝑠2, .., 𝑠𝑖, .., 𝑠𝑛}𝑚-Order, time-homogeneous, Markov chains (by default first-order, 𝑚 = 1):

𝑃(𝑎𝑖|𝑎1, 𝑎2, … , 𝑎𝑖−1) = 𝑃(𝑎𝑖|𝑎𝑖−1)

Markov chains are represented by Transition Matrix

A Transition Matrix of first order 𝑚 = 1 and 20 states (𝑛 = 20):

𝑇 =⎛⎜⎜⎜⎜⎝

state 𝑠1 𝑠2 … 𝑠20𝑠1 𝑝(𝑠1|𝑠1) 𝑝(𝑠2|𝑠1) … 𝑝(𝑠20|𝑠1)𝑠2 𝑝(𝑠1|𝑠2) 𝑝(𝑠2|𝑠2) … 𝑝(𝑠20|𝑠2)⋮ ⋮ ⋮ ⋱ ⋮𝑠20 𝑝(𝑠1|𝑠20) 𝑝(𝑠2|𝑠20) … 𝑝(𝑠20|𝑠20)

⎞⎟⎟⎟⎟⎠

7

Fitting 1𝑠𝑡-order Markov chains (𝑚 = 1)

Constructing the Transition Matrix

Let 𝒟 = {(𝑥𝑖, 𝑦𝑖)|𝑖 = 1 ∶ 𝑁} be all training examples.

Let 𝒟𝑦 be a subset of training sequences labeled with location𝑦 ∈ 𝒴 = { , , … , }The 20 × 20 parameters 𝑝(𝑠𝑖|𝑠𝑗, 𝑦) computed by:

𝑝(𝑠𝑖|𝑠𝑗, 𝑦) = ∑𝑥∈𝒟𝑦 count(𝑥, 𝑠𝑗𝑠𝑖)/𝑍𝑍 is a normalization factor.

8

Traditional method of Markov chain inference

Given a Markov chain (20 × 20 parameters) trained on location 𝑦 = .

Given an unknown sequence 𝑥 = (𝑎1𝑎2 … 𝑎𝐿).The probability 𝑝(𝑥 = |𝑦 = ):

𝑝(𝑥|𝑦) = 𝑝(𝑎2|𝑎1, 𝑦)𝑝(𝑎3|𝑎2, 𝑦) … 𝑝(𝑎𝐿|𝑎𝐿−1, 𝑦)Propensity Ω(𝑥|𝑦): to avoid underflow problems, we rely on the log(𝑝) instead:

Ω(𝑥|𝑦) = ∑𝐿𝑖=2 log 𝑝(𝑎𝑖|𝑎𝑖−1, 𝑦)

The 𝑦 that maximizes Ω(𝑥|𝑦) is reported as a prediction. (Yuan, 1999)

Nothing new here .... yet.

9







Nothing new here

.... yet.

9







Nothing new here .... yet.9

Methodology

Group Transition Matrix parameters into set of probability distributions

Group the transition matrix rows as probability distributions:

𝑇 =⎛⎜⎜⎜⎜⎝

𝑝(𝑠1|𝑠1) 𝑝(𝑠2|𝑠1) … 𝑝(𝑠20|𝑠1)𝑝(𝑠1|𝑠2) 𝑝(𝑠2|𝑠2) … 𝑝(𝑠20|𝑠2)

⋮ ⋮ ⋱ ⋮𝑝(𝑠1|𝑠20) 𝑝(𝑠2|𝑠20) … 𝑝(𝑠20|𝑠20)

⎞⎟⎟⎟⎟⎠

Alternative representation as 𝑛 probability distributions (𝑛 = 20)

𝐻 = {𝑄1, 𝑄2, … , 𝑄𝑛}After fitting all training examples in the dataset across all classes, we getBackbone Profiles ℋ:

ℋ = {𝐻𝑦= , … , 𝐻𝑦= }

10



𝑇 =⎛⎜⎜⎜⎜⎝


⋮ ⋮ ⋱ ⋮𝑝(𝑠1|𝑠20) 𝑝(𝑠2|𝑠20) … 𝑝(𝑠20|𝑠20)

⎞⎟⎟⎟⎟⎠


𝐻 = {𝑄1, 𝑄2, … , 𝑄𝑛}

After fitting all training examples in the dataset across all classes, we getBackbone Profiles ℋ:

ℋ = {𝐻𝑦= , … , 𝐻𝑦= }

10



𝑇 =⎛⎜⎜⎜⎜⎝


⋮ ⋮ ⋱ ⋮𝑝(𝑠1|𝑠20) 𝑝(𝑠2|𝑠20) … 𝑝(𝑠20|𝑠20)

⎞⎟⎟⎟⎟⎠


𝐻 = {𝑄1, 𝑄2, … , 𝑄𝑛}After fitting all training examples in the dataset across all classes, we getBackbone Profiles ℋ:

ℋ = {𝐻𝑦= , … , 𝐻𝑦= }10

Projecting a sequence 𝑥 against a backbone profile 𝐻𝑦

From 𝒟𝑦 ∈ 𝒟,

we already have

a Transition Matrix 𝑇 𝑦.

Alternative structure𝐻𝑦,

𝐻𝑦 = {𝑄𝑦1, … , 𝑄𝑦

𝑛}

From a given sequence 𝑥 = ,

we construct

a Transition Matrix 𝑇 𝑥.

Alternative structure𝐻𝑥,

𝐻𝑥 = {𝑄𝑥1 , … , 𝑄𝑥

𝑛}Let 𝜙 ∶ ℝ𝑛 × ℝ𝑛 ↦ ℝ: measures similarity between two probabilitydistributions.Compute the projection between 𝐻𝑦 and 𝐻𝑥 :

project(𝐻𝑥, 𝐻𝑦) = [cos(𝑄𝑥1 , 𝑄𝑦

1), … , cos(𝑄𝑥𝑛, 𝑄𝑦

𝑛)]

Note: 𝑛 parameters for project(𝐻𝑥, 𝐻𝑦)

11



we already have



𝐻𝑦 = {𝑄𝑦1, … , 𝑄𝑦

𝑛}


we construct



𝐻𝑥 = {𝑄𝑥1 , … , 𝑄𝑥

𝑛}

Let 𝜙 ∶ ℝ𝑛 × ℝ𝑛 ↦ ℝ: measures similarity between two probabilitydistributions.Compute the projection between 𝐻𝑦 and 𝐻𝑥 :



𝑛)]


11



we already have



𝐻𝑦 = {𝑄𝑦1, … , 𝑄𝑦

𝑛}


we construct



𝐻𝑥 = {𝑄𝑥1 , … , 𝑄𝑥

𝑛}Let 𝜙 ∶ ℝ𝑛 × ℝ𝑛 ↦ ℝ: measures similarity between two probabilitydistributions.

Compute the projection between 𝐻𝑦 and 𝐻𝑥 :



𝑛)]


11



we already have



𝐻𝑦 = {𝑄𝑦1, … , 𝑄𝑦

𝑛}


we construct



𝐻𝑥 = {𝑄𝑥1 , … , 𝑄𝑥




𝑛)]


11



we already have



𝐻𝑦 = {𝑄𝑦1, … , 𝑄𝑦

𝑛}


we construct



𝐻𝑥 = {𝑄𝑥1 , … , 𝑄𝑥




𝑛)]

Note: 𝑛 parameters for project(𝐻𝑥, 𝐻𝑦)11

Similairity/dissimilarity function (𝜙(𝐴, 𝐵)) selection

Myriad of similarity metrics in literature.

Our experiments show stable performance for cosine function across differentdatasets.

Cosine (similarity)

𝑛∑𝑖=1

𝐴𝑖𝐵𝑖

√𝑛∑𝑖=1

𝐴2𝑖 √

𝑛∑𝑖=1

𝐵2𝑖

Chi-squared (dissimilarity)

𝑛∑𝑖=1

(𝐴𝑖 − 𝐵𝑖)2

𝐴𝑖

Hellinger (dissimilarity)

1√2

√𝑛

∑𝑖=1

(√𝐴𝑖 − √𝐵𝑖)2

12

Projecting a sequence 𝑥 against backbone profiles ℋ

We apply project(𝐻𝑥, 𝐻𝑦) ∀𝑦 ∈ 𝒴 = { , , … , }.Let 𝐶 = |𝒴| be the count of the subcellular locations.∴ a sequence 𝑥 has a latent representation:

𝔽𝑥 = [project(𝐻𝑥, 𝐻1) ∶ … ∶ project(𝐻𝑥, 𝐻𝐶)]

Note: 𝐶 × 𝑛 parameters for 𝔽𝑥.

13


We apply project(𝐻𝑥, 𝐻𝑦) ∀𝑦 ∈ 𝒴 = { , , … , }.

Let 𝐶 = |𝒴| be the count of the subcellular locations.∴ a sequence 𝑥 has a latent representation:



13


We apply project(𝐻𝑥, 𝐻𝑦) ∀𝑦 ∈ 𝒴 = { , , … , }.Let 𝐶 = |𝒴| be the count of the subcellular locations.

∴ a sequence 𝑥 has a latent representation:



13


We apply project(𝐻𝑥, 𝐻𝑦) ∀𝑦 ∈ 𝒴 = { , , … , }.Let 𝐶 = |𝒴| be the count of the subcellular locations.∴ a sequence 𝑥 has a latent representation:



13

Second stage: Combine a discriminative classifier

What we are up to: transform 𝒟 to �̂�

𝒟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞

⎡⎢⎢⎢⎢⎣

𝑥 (sequence) 𝑦 (location)

1 𝑦1

2 𝑦2

⋮ ⋮

𝑁 𝑦𝑁

⎤⎥⎥⎥⎥⎦

�̂�⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞

⎡⎢⎢⎢⎣

𝔽𝑥 (latent) 𝑦 (location)

𝔽𝑥1 𝑦1

𝔽𝑥2 𝑦2⋮ ⋮

𝔽𝑥𝑁 𝑦𝑁

⎤⎥⎥⎥⎦

Finally introduce �̂� to a secondary discriminative classification framework.E.g: KNN, Random Forest, LDA, SVM.To keep interpretability, the secondary classifier needs to be interpretable as well.

14


What we are up to: transform 𝒟 to �̂�𝒟

⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞

⎡⎢⎢⎢⎢⎣


1 𝑦1

2 𝑦2

⋮ ⋮

𝑁 𝑦𝑁

⎤⎥⎥⎥⎥⎦

�̂�⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞

⎡⎢⎢⎢⎣


𝔽𝑥1 𝑦1

𝔽𝑥2 𝑦2⋮ ⋮


⎤⎥⎥⎥⎦


14



⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞

⎡⎢⎢⎢⎢⎣


1 𝑦1

2 𝑦2

⋮ ⋮

𝑁 𝑦𝑁

⎤⎥⎥⎥⎥⎦

�̂�⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞

⎡⎢⎢⎢⎣


𝔽𝑥1 𝑦1

𝔽𝑥2 𝑦2⋮ ⋮


⎤⎥⎥⎥⎦

Finally introduce �̂� to a secondary discriminative classification framework.

E.g: KNN, Random Forest, LDA, SVM.To keep interpretability, the secondary classifier needs to be interpretable as well.

14



⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞

⎡⎢⎢⎢⎢⎣


1 𝑦1

2 𝑦2

⋮ ⋮

𝑁 𝑦𝑁

⎤⎥⎥⎥⎥⎦

�̂�⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞

⎡⎢⎢⎢⎣


𝔽𝑥1 𝑦1

𝔽𝑥2 𝑦2⋮ ⋮


⎤⎥⎥⎥⎦

Finally introduce �̂� to a secondary discriminative classification framework.E.g: KNN, Random Forest, LDA, SVM.

To keep interpretability, the secondary classifier needs to be interpretable as well.

14



⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞

⎡⎢⎢⎢⎢⎣


1 𝑦1

2 𝑦2

⋮ ⋮

𝑁 𝑦𝑁

⎤⎥⎥⎥⎥⎦

�̂�⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞

⎡⎢⎢⎢⎣


𝔽𝑥1 𝑦1

𝔽𝑥2 𝑦2⋮ ⋮


⎤⎥⎥⎥⎦


14

Experiments & Results

Datasets: DeepLoc Almagro Armenteros et al., 2017

Eukaryotic proteins annotated with experimentally verified localization sites.

The DeepLoc data-set contains 13858 proteins.

Location Count Location Count

Nucleus 4043 Cytoplasm 2542

Extracellular 1973 Mitochondrion 1510

Cell membrane 1340 Endoplasmic reticulum 862

Plastid 757 Golgi apparatus 356

Lysosome/Vacuole 321 Peroxisome 154

15

Results: selection of the secondary classifier

10 classes problem.One-vs-all Random Forests (RF), (#decision trees = 1000).One-vs-all SVM, (RBF, Gamma parameters tuned by MaxLipo+TR from DLib toolkit).Stratified 10-cross validation.Cosine distance used.

OA ≜ 𝑇 𝑃 + 𝑇 𝑁𝑇 𝑃 + 𝑇 𝑁 + 𝐹𝑃 + 𝐹𝑁

SN ≜ 𝑇 𝑃𝑇 𝑃 + 𝐹𝑁

SP ≜ 𝑇 𝑁𝑇 𝑁 + 𝐹𝑃

MCC ≜∑𝑁

𝑘,𝑙,𝑚=1 𝐶𝑘𝑘𝐶𝑚𝑙−𝐶𝑙𝑘𝐶𝑘𝑚

√∑𝑁𝑘=1[(∑𝑁

𝑙=1 𝐶𝑙𝑘)(∑𝑁𝑓,𝑔=1;𝑓≠1 𝐶𝑔𝑓)√∑𝑁

𝑘=1[(∑𝑁𝑙=1 𝐶𝑘𝑙)(∑𝑁

𝑓,𝑔=1;𝑓≠1 𝐶𝑓𝑔)]]16

Conclusion

Conclusion, future work & � Code availability

Generative model combined with a discriminative can improve accuracy.Candidate supervised representation learning when interpretability is demanded.

Future endeavors

Optimize the integration of the generative-discriminative models.

Possible generative candidates: Markov Random Fields (MRF), VariationalAutoencoders (VAE).

You can clone the code at:

� github.com/aametwally/MC_MicroSimilarities

Available tutorial and datasets to reproduce the results in the paper17

https://github.com/aametwally/MC_MicroSimilarities

Questions & references

Thank you for listening!

Questions?Almagro Armenteros, José Juan et al. (Nov. 2017). “DeepLoc: prediction of proteinsubcellular localization using deep learning”. In: Bioinformatics 33.21. Ed. byJohn Hancock, pp. 3387–3395. ISSN: 1367-4803. DOI:10.1093/bioinformatics/btx431. URL:https://academic.oup.com/bioinformatics/article/33/21/3387/3931857.

Miller, Tim (2018). “Explanation in artificial intelligence: Insights from the social sciences”.In: Artificial Intelligence.

Yuan, Zheng (May 1999). “Prediction of protein subcellular locations using Markov chainmodels”. In: FEBS Letters 451.1, pp. 23–26. ISSN: 0014-5793. DOI:10.1016/S0014-5793(99)00506-2. URL:https://www.sciencedirect.com/science/article/pii/S0014579399005062. 18

https://doi.org/10.1093/bioinformatics/btx431

https://academic.oup.com/bioinformatics/article/33/21/3387/3931857

https://doi.org/10.1016/S0014-5793(99)00506-2

https://www.sciencedirect.com/science/article/pii/S0014579399005062

protein subcellular localization ... - a-alaa.github.io · protein subcellular localization...

Documents