"probabilistic latent semantic analysis for prediction of gene ontology annotations" -...

DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE

Probabilistic Latent Semantic Analysis

for prediction of

Gene Ontology annotations

Davide Chicco, Pietro Pinoli, Marco Masseroli

[email protected]

2012

Davide Chicco @ PhDay2012 2

Summary

1. The problem

• Biomolecular annotations

• Prediction of biomolecular annotations

2. The methods

• SVD – Singular Value Decomposition

• pLSA – Probabilistic Latent Semantic Analysis

3. Evaluation

• Evaluation data set

• Evaluation results

4. Conclusions


Biomolecular annotations

• The concept of annotation: association of nucleotide or amino

acid sequences with useful information describing their features

• This information is expressed through controlled vocabularies,

sometimes structured as ontologies, where every controlled

term of the vocabulary is associated with a unique

alphanumeric code

• The association of such a code with a gene or protein ID

constitutes an annotation

Gene /

Protein

Biological function feature

Annotation

gene2bff


Biomolecular annotations (2)

• The association of an information/feature with a gene or

protein ID constitutes an annotation

• Annotation example:

• gene: GD4

• feature: “is present in the mitochondrial membrane”

Gene /

Protein

Biological function feature

Annotation

gene2bff


Prediction of biomolecular annotations

• Many available annotations in different databanks

• However, available annotations are incomplete

• Only a few of them represent highly reliable, human–curated

information

• To support and quicken the time–consuming curation process,

prioritized lists of computationally predicted annotations

are extremely useful

• These lists could be generated softwares based that implement

Machine Learning algorithms


Annotation prediction through

Singular Value Decomposition – SVD

• Annotation matrix A {0, 1} m x n

− m rows: genes / proteins

− n columns: annotation terms

A(i,j) = 1 if gene / protein i is annotated to term j or to any

descendant of j in the considered ontology structure (true

path rule)

A(i,j) = 0 otherwise (it is unknown)

term01 term02 term03 term04 … termN

gene01 0 0 0 0 … 0

gene02 0 1 1 0 … 1

… … … … … … …

geneM 0 0 0 0 … 0


Annotation prediction through


• Annotation matrix A {0, 1} m x n

− m rows: genes / proteins

− n columns: annotation terms

A(i,j) = 1 if gene / protein i is annotated to term j or to any

descendant of j in the considered ontology structure (true

path rule)

A(i,j) = 0 otherwise (it is unknown)

term01 term02 term03 term04 … termN

gene01 0 0 0 0 … 0

gene02 0 1 1 0 … 1

… … … … … … …

geneM 0 0 0 0 … 0


Compute SVD:

Compute reduced rank approximation:

• An annotation prediction is performed by computing a reduced

rank approximation Ak of the annotation matrix A

(where 0 < k < r, with r the number of non zero singular values

of A, i.e. the rank of A)

TA U V

TA U V

TA U V TA U V TA U V

T

k k k kA U V

k

T

k k k kA U V T

k k k kA U V T

k k k kA U V T

k k k kA U V

k



Probabilistic Latent Semantic Analysis - pLSA

pLSA:

• An alternative to the SVD method

• Based on Latent Semantic Indexing (LSI)

Latent Semantic Indexing – LSI:

• Identifies latent relationships between different elements

in a certain class

− e.g. between documents and words within them

− between genes and their biomolecular features

described by controlled annotation terms

• Maps class elements to a vector space of reduced

dimensionality, and then analyzes it


Probabilistic Latent Semantic Analysis - pLSA (2)

Suppose you have;

• A set of genes G = {g1, …, gn} related to a set of feature

terms F = {f1, …, fn} which, together, form a set of controlled

biomolecular annotations

• A set of class variables T = {t1, …, tn},

called topics, with every feature

term f F that can be associated

with a topic t T

The pLSA statistical model associates

every unobserved class variable

(topic) with each observation

(feature term and gene)



Ff

tfPTt 1)|(,

Tt

gtPGg 1)|(,

Tt

tfPtgPtPfgP )|()|()(),(

• P(f | t): probability of a feature term f to be associated with a

topic t

• P(t | g): probability of getting a topic t by selecting a gene g

• The following conditions hold:

•

•

• The joint probability between g and f is given by:



Model training

• Aim: maximum likelihood estimation of P(f|t) by using

Expectation Maximization (EM) algorithm, on a training set

Model validation

• Gene and feature term validation set with the same feature

terms, but completely different genes, respect to the ones in

the training set

• Aim: maximize the formula in [1], but by using the P(f|t)

calculated in the training phase and varying the parameters

P(t|g) related to the new genes in the validation set

]1[),(log),( fgPfgaLGg Ff



EM Algorithm:

It seeks to find a Maximum Likelihood Estimation by iteratively

applying:

• Expectation step: in which the a posteriori probabilities for the

latent variables t are computed, as

• Maximization step: in which the parameters values are updated

in order to maximize the log-likelihood.

)|,()(),|( tfgPtPfgtP



In comparison to SVD:

Uk = [ P(gi|tk) ] ik

k = diag[ P(tk) ] k

Vk = [ P(fi|tk) ]jk

Ak = [ P(gi, fj) ]ij = Uk k VkT

T

k k k kA U V

k

T

k k k kA U V T

k k k kA U V T

k k k kA U V T

k k k kA U V

k



Since the pLSA model constraints:

• This can bias the prediction because the more annotations a

gene has, the lower its average conditional probability is

• To avoid such bias we propose a normalized extension of pLSA:

• :

i. Compute:

ii. Compute the normalized P(f | g) vector as:

• Thus, the feature terms with the highest conditional probability

for a gene always result predicted to be annotated to that gene

Ff

gfPGg 1)|(,

g G

max ( | )f F

M P f g

1( | ) ( | )normP f g P f g

M


Evaluation of the prediction

To evaluate the prediction, we compare each A(i,j) element to its

corresponding Ak(i,j) for each real threshold τ, with 0 ≤ τ ≤ 1.0

• if A(i,j) = 1 & Ak(i,j) > τ: AC: Annotation Confirmed

(AC AC+1)

• if A(i,j) = 1 & Ak(i,j) ≤ τ: AR: Annotation to be Reviewed

(AR AR+1)

• if A(i,j) = 0 & Ak(i,j) ≤ τ: NAC: No Annotation Confirmed

(NAC NAC+1)

• if A(i,j) = 0 & Ak(i,j) > τ: AP: annotation predicted

(AP AP+1)


New concept: Receiver Operating Characteristic

(ROC) curve

Starting from the annotation prediction evaluation factor we just

introduced

AC: Annotation Confirmed

AR: Annotation to be Reviewed

NAC: No Annotation Confirmed

AP: Annotation Predicted

We can design the Receiver Operating Characteristic curves for

every prediction:

On the x, the annotation to be reviewed rate:

On the y, the annotation predicted rate:

Input Output

Yes Yes

Yes No

No No

No Yes


Evaluation data set

• We considered the Gene Ontology annotations of organisms:

Gallus gallus (Chicken), and Bos taurus (Cattle)

− Excluding less reliable Inferred Electronic Annotations

• After this, the four organism data set were:

with total (true-path-rule) annotations about 10-times more

than the direct annotations

Organism Ontology Genes Terms Annotations

(direct )

Gallus gallus BP 275 527 738

Gallus gallus CC 260 148 478

Gallus gallus MF 309 225 509

Bos taurus BP 512 930 1,557

Bos taurus CC 497 234 921

Bos taurus MF 543 422 934


Evaluation results

•The ROC curve of annotation to be

reviewed rate AR / (AC + AR) and

annotation predicted rate AP / (AP +

NAC) of Bos taurus (Cattle) Cellular

Component (top left), Molecular

Function (top) and Biological Process

(left), for SVD with best truncation value

(in red) and for pLSAnorm with best

topics number (in green)


Evaluation results (3)

• As an aggregated indicator of prediction performance, we

computed the Area Under the Curve(AUC) in the [0; 0.01] range

of AP rate values

− We are interested in the low range of AP rate, since it

corresponds to top-ranked predictions of newly inferred

annotations (AP) with the highest score

Area under ROC curves (AUC) % and Execution Time (sec)

Taxonomy ID Ontology SVD pLSAnorm Time(SVD) Time(pLSAnorm)

Bos taurus BP 44.30 34.75 33 28 188

Bos taurus CC 53.03 27.31 36 4 674

Bos taurus MF 80.96 30.69 11 1 890

Gallus gallus BP 47.33 44.83 98 3 990

Gallus gallus CC 75.39 37.22 10 796

Gallus gallus MF 65.76 29.87 5 422


Conclusions

• We proposed the pLSAnorm method as a novel contribution in

the context of prediction of genomic ontological annotations

- Our pLSAnorm method gives better predictions than the

Singular Value Decomposition (SVD) method

- Higher execution time of pLSAnorm vs. SVD requires better

optimizations, currently limiting its use to off-line analysis or

small dimension data sets


Conclusions (2)

• Our approach is not limited to the here considered Gene

Ontology and can be applied to any controlled annotations

• Increasingly available multiple annotations from different

controlled vocabularies and ontologies could be jointly

considered to further improve prediction reliability (both in

SVD and pLSAnorm)


Thank you for your attention

Probabilistic Latent Semantic Analysis for

prediction of Gene Ontology annotations

"probabilistic latent semantic analysis for prediction of gene ontology annotations" -...

Documents

f f g g

gene g

featureterm f f

feature term f

f log p g

set of genes g

set of feature terms

annotation annotation