"probabilistic latent semantic analysis for prediction of gene ontology annotations" -...
DESCRIPTION
Talk delivered by Davide Chicco at PhDay 2012 at Dipartimento di Elettronica e Informazione of Politecnico di Milano, Milan, September 2012.TRANSCRIPT
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE
Probabilistic Latent Semantic Analysis
for prediction of
Gene Ontology annotations
Davide Chicco, Pietro Pinoli, Marco Masseroli
2012
Davide Chicco @ PhDay2012 2
Summary
1. The problem
• Biomolecular annotations
• Prediction of biomolecular annotations
2. The methods
• SVD – Singular Value Decomposition
• pLSA – Probabilistic Latent Semantic Analysis
3. Evaluation
• Evaluation data set
• Evaluation results
4. Conclusions
Davide Chicco @ PhDay2012 3
Biomolecular annotations
• The concept of annotation: association of nucleotide or amino
acid sequences with useful information describing their features
• This information is expressed through controlled vocabularies,
sometimes structured as ontologies, where every controlled
term of the vocabulary is associated with a unique
alphanumeric code
• The association of such a code with a gene or protein ID
constitutes an annotation
Gene /
Protein
Biological function feature
Annotation
gene2bff
Davide Chicco @ PhDay2012 4
Biomolecular annotations (2)
• The association of an information/feature with a gene or
protein ID constitutes an annotation
• Annotation example:
• gene: GD4
• feature: “is present in the mitochondrial membrane”
Gene /
Protein
Biological function feature
Annotation
gene2bff
Davide Chicco @ PhDay2012 5
Prediction of biomolecular annotations
• Many available annotations in different databanks
• However, available annotations are incomplete
• Only a few of them represent highly reliable, human–curated
information
• To support and quicken the time–consuming curation process,
prioritized lists of computationally predicted annotations
are extremely useful
• These lists could be generated softwares based that implement
Machine Learning algorithms
Davide Chicco @ PhDay2012 7
Annotation prediction through
Singular Value Decomposition – SVD
• Annotation matrix A {0, 1} m x n
− m rows: genes / proteins
− n columns: annotation terms
A(i,j) = 1 if gene / protein i is annotated to term j or to any
descendant of j in the considered ontology structure (true
path rule)
A(i,j) = 0 otherwise (it is unknown)
term01 term02 term03 term04 … termN
gene01 0 0 0 0 … 0
gene02 0 1 1 0 … 1
… … … … … … …
geneM 0 0 0 0 … 0
Davide Chicco @ PhDay2012 8
Annotation prediction through
Singular Value Decomposition – SVD
• Annotation matrix A {0, 1} m x n
− m rows: genes / proteins
− n columns: annotation terms
A(i,j) = 1 if gene / protein i is annotated to term j or to any
descendant of j in the considered ontology structure (true
path rule)
A(i,j) = 0 otherwise (it is unknown)
term01 term02 term03 term04 … termN
gene01 0 0 0 0 … 0
gene02 0 1 1 0 … 1
… … … … … … …
geneM 0 0 0 0 … 0
Davide Chicco @ PhDay2012 9
Compute SVD:
Compute reduced rank approximation:
• An annotation prediction is performed by computing a reduced
rank approximation Ak of the annotation matrix A
(where 0 < k < r, with r the number of non zero singular values
of A, i.e. the rank of A)
TA U V
TA U V
TA U V TA U V TA U V
T
k k k kA U V
k
T
k k k kA U V T
k k k kA U V T
k k k kA U V T
k k k kA U V
k
Singular Value Decomposition – SVD
Davide Chicco @ PhDay2012 10
Probabilistic Latent Semantic Analysis - pLSA
pLSA:
• An alternative to the SVD method
• Based on Latent Semantic Indexing (LSI)
Latent Semantic Indexing – LSI:
• Identifies latent relationships between different elements
in a certain class
− e.g. between documents and words within them
− between genes and their biomolecular features
described by controlled annotation terms
• Maps class elements to a vector space of reduced
dimensionality, and then analyzes it
Davide Chicco @ PhDay2012 11
Probabilistic Latent Semantic Analysis - pLSA (2)
Suppose you have;
• A set of genes G = {g1, …, gn} related to a set of feature
terms F = {f1, …, fn} which, together, form a set of controlled
biomolecular annotations
• A set of class variables T = {t1, …, tn},
called topics, with every feature
term f F that can be associated
with a topic t T
The pLSA statistical model associates
every unobserved class variable
(topic) with each observation
(feature term and gene)
Davide Chicco @ PhDay2012 12
Probabilistic Latent Semantic Analysis - pLSA (3)
Ff
tfPTt 1)|(,
Tt
gtPGg 1)|(,
Tt
tfPtgPtPfgP )|()|()(),(
• P(f | t): probability of a feature term f to be associated with a
topic t
• P(t | g): probability of getting a topic t by selecting a gene g
• The following conditions hold:
•
•
• The joint probability between g and f is given by:
Davide Chicco @ PhDay2012 13
Probabilistic Latent Semantic Analysis - pLSA (4)
Model training
• Aim: maximum likelihood estimation of P(f|t) by using
Expectation Maximization (EM) algorithm, on a training set
Model validation
• Gene and feature term validation set with the same feature
terms, but completely different genes, respect to the ones in
the training set
• Aim: maximize the formula in [1], but by using the P(f|t)
calculated in the training phase and varying the parameters
P(t|g) related to the new genes in the validation set
]1[),(log),( fgPfgaLGg Ff
Davide Chicco @ PhDay2012 14
Probabilistic Latent Semantic Analysis - pLSA (5)
EM Algorithm:
It seeks to find a Maximum Likelihood Estimation by iteratively
applying:
• Expectation step: in which the a posteriori probabilities for the
latent variables t are computed, as
• Maximization step: in which the parameters values are updated
in order to maximize the log-likelihood.
)|,()(),|( tfgPtPfgtP
Davide Chicco @ PhDay2012 15
Probabilistic Latent Semantic Analysis - pLSA (5)
In comparison to SVD:
Uk = [ P(gi|tk) ] ik
k = diag[ P(tk) ] k
Vk = [ P(fi|tk) ]jk
Ak = [ P(gi, fj) ]ij = Uk k VkT
T
k k k kA U V
k
T
k k k kA U V T
k k k kA U V T
k k k kA U V T
k k k kA U V
k
Davide Chicco @ PhDay2012 16
Probabilistic Latent Semantic Analysis - pLSA (6)
Since the pLSA model constraints:
• This can bias the prediction because the more annotations a
gene has, the lower its average conditional probability is
• To avoid such bias we propose a normalized extension of pLSA:
• :
i. Compute:
ii. Compute the normalized P(f | g) vector as:
• Thus, the feature terms with the highest conditional probability
for a gene always result predicted to be annotated to that gene
Ff
gfPGg 1)|(,
g G
max ( | )f F
M P f g
1( | ) ( | )normP f g P f g
M
Davide Chicco @ PhDay2012 17
Evaluation of the prediction
To evaluate the prediction, we compare each A(i,j) element to its
corresponding Ak(i,j) for each real threshold τ, with 0 ≤ τ ≤ 1.0
• if A(i,j) = 1 & Ak(i,j) > τ: AC: Annotation Confirmed
(AC AC+1)
• if A(i,j) = 1 & Ak(i,j) ≤ τ: AR: Annotation to be Reviewed
(AR AR+1)
• if A(i,j) = 0 & Ak(i,j) ≤ τ: NAC: No Annotation Confirmed
(NAC NAC+1)
• if A(i,j) = 0 & Ak(i,j) > τ: AP: annotation predicted
(AP AP+1)
Davide Chicco @ PhDay2012 18
New concept: Receiver Operating Characteristic
(ROC) curve
Starting from the annotation prediction evaluation factor we just
introduced
AC: Annotation Confirmed
AR: Annotation to be Reviewed
NAC: No Annotation Confirmed
AP: Annotation Predicted
We can design the Receiver Operating Characteristic curves for
every prediction:
On the x, the annotation to be reviewed rate:
On the y, the annotation predicted rate:
Input Output
Yes Yes
Yes No
No No
No Yes
Davide Chicco @ PhDay2012 19
Evaluation data set
• We considered the Gene Ontology annotations of organisms:
Gallus gallus (Chicken), and Bos taurus (Cattle)
− Excluding less reliable Inferred Electronic Annotations
• After this, the four organism data set were:
with total (true-path-rule) annotations about 10-times more
than the direct annotations
Organism Ontology Genes Terms Annotations
(direct )
Gallus gallus BP 275 527 738
Gallus gallus CC 260 148 478
Gallus gallus MF 309 225 509
Bos taurus BP 512 930 1,557
Bos taurus CC 497 234 921
Bos taurus MF 543 422 934
Davide Chicco @ PhDay2012 20
Evaluation results
•The ROC curve of annotation to be
reviewed rate AR / (AC + AR) and
annotation predicted rate AP / (AP +
NAC) of Bos taurus (Cattle) Cellular
Component (top left), Molecular
Function (top) and Biological Process
(left), for SVD with best truncation value
(in red) and for pLSAnorm with best
topics number (in green)
Davide Chicco @ PhDay2012 22
Evaluation results (3)
• As an aggregated indicator of prediction performance, we
computed the Area Under the Curve(AUC) in the [0; 0.01] range
of AP rate values
− We are interested in the low range of AP rate, since it
corresponds to top-ranked predictions of newly inferred
annotations (AP) with the highest score
Area under ROC curves (AUC) % and Execution Time (sec)
Taxonomy ID Ontology SVD pLSAnorm Time(SVD) Time(pLSAnorm)
Bos taurus BP 44.30 34.75 33 28 188
Bos taurus CC 53.03 27.31 36 4 674
Bos taurus MF 80.96 30.69 11 1 890
Gallus gallus BP 47.33 44.83 98 3 990
Gallus gallus CC 75.39 37.22 10 796
Gallus gallus MF 65.76 29.87 5 422
Davide Chicco @ PhDay2012 23
Conclusions
• We proposed the pLSAnorm method as a novel contribution in
the context of prediction of genomic ontological annotations
- Our pLSAnorm method gives better predictions than the
Singular Value Decomposition (SVD) method
- Higher execution time of pLSAnorm vs. SVD requires better
optimizations, currently limiting its use to off-line analysis or
small dimension data sets
Davide Chicco @ PhDay2012 24
Conclusions (2)
• Our approach is not limited to the here considered Gene
Ontology and can be applied to any controlled annotations
• Increasingly available multiple annotations from different
controlled vocabularies and ontologies could be jointly
considered to further improve prediction reliability (both in
SVD and pLSAnorm)
Davide Chicco @ PhDay2012 25
Thank you for your attention
Probabilistic Latent Semantic Analysis for
prediction of Gene Ontology annotations