integrating protein-protein interactions: bayesian networks
Post on 12-Jan-2016
48 Views
Preview:
DESCRIPTION
TRANSCRIPT
25. Lecture WS 2003/04
Bioinformatics III 1
Integrating Protein-Protein Interactions: Bayesian Networks
- Lot of direct experimental data coming about protein-protein interactions
(Y2H, MS)
Jansen et al. Science 302, 449 (2003)
- Genomic information also provides indirect information:
- interacting proteins are often significantly coexpressed ( microarrays)
- interacting proteins are often colocalized to the same subcellular
compartment
25. Lecture WS 2003/04
Bioinformatics III 2
Problems
Jansen et al. Science 302, 449 (2003)
Unfortunately, interaction data sets are often incomplete and contradictory (von
Mering et al. 2002)
In the context of genome-wide analyses, these inaccuracies are greatly magnified
because the protein pairs that do not interact (negatives) by far outnumber those
that do interact (positives).
E.g. in yeast, the ~6000 proteins allow for N (N-1) / 2 ~ 18 million potential
interactions. But the estimated number of actual interactions is < 100.000.
Therefore, even reliable techniques can generate many false positives when
applied genome-wide.
Think of a diagnostic with a 1% false-positive rate for a rare disease occurring in
0.1% of the population. This would roughly produce 1 true positive for every 10
false ones.
25. Lecture WS 2003/04
Bioinformatics III 3
Integrative Approach
Jansen et al. Science 302, 449 (2003)
One would like to integrate evidence from many different sources to increase the
predictivity of true and false protein-protein predictions.
Here, use Bayesian approach for integrating interaction information that allows for
the probabilistic combination of multiple data sets; apply to yeast.
Input: Approach can be used for combining noisy genomic interaction data sets.
Normalization: Each source of evidence for interactions is compared against
samples of known positives and negatives (“gold-standard”).
Output: predict for every possible protein pair likelihood of interaction.
Verification: test on experimental interaction data not included in the gold-
standard + new TAP (tandem affinity purification experiments).
25. Lecture WS 2003/04
Bioinformatics III 4
Integration of various information sources
Jansen et al. Science 302, 449 (2003)
(iii) Gold-standards of known interactions
and noninteracting protein pairs.
The three different types of data
used: (i) Interaction data from
high-throughput experiments.
These comprise large-scale two-
hybrid screens (Y2H) (Uetz et al.,
Ito et al.) and in vivo pull-down
experiments (Gavin et al., Ho et
al. ).
(ii) Other genomic features. We
considered expression data,
biological function of proteins
(from Gene Ontology biological
process and the MIPS functional
catalog), and data about whether
proteins are essential.
25. Lecture WS 2003/04
Bioinformatics III 5
Combination of data sets into probabilistic interactomes
(B) Combination of data sets into
probabilistic interactomes.
The 4 interaction data sets
from HT experiments were
combined into 1 PIE.
The PIE represents a
transformation of the
individual binary-valued
interaction sets into a data
set where every protein pair
is weighed according to the
likelihood that it exists in a
complex. A „naïve” Bayesian network is used to model
the PIP data. These information sets hardly
overlap.
Jansen et al. Science 302, 449 (2003)
Because the 4 experimental
interaction data sets contain
correlated evidence, a fully
connected Bayesian network
is used.
25. Lecture WS 2003/04
Bioinformatics III 6
Bayesian Networks
Bayesian networks are probabilistic models that graphically encode probabilistic
dependencies between random variables.Y
E1 E2E3
Bayesian networks also include a quantitative measure of dependency. For each
variable and its parents this measure is defined using a conditional probability
function or a table.
Here, one such measure is the proabability Pr(E1|Y).
A directed arc between variables
Y and E1 denotes conditional
dependency of E1 on Y, as
determined by the direction of
the arc.
25. Lecture WS 2003/04
Bioinformatics III 7
Bayesian Networks
Together, the graphical structure and the conditional probability functions/tables
completely specify a Bayesian network probabilistic model.
Y
E1 E2E3
Here, Pr(Y,E1,E2,E3) = Pr(E1|Y) Pr(E2|Y) Pr(E3|Y) Pr(Y)
This model, in turn, specifies a
particular factorization of the joint
probability distribution function
over the variables in the
networks.
25. Lecture WS 2003/04
Bioinformatics III 8
Gold-Standard
Jansen et al. Science 302, 449 (2003)
should be
(i) independent from the data sources serving as evidence
(ii) sufficiently large for reliable statistics
(iii) free of systematic bias (e.g. towards certain types of interactions).
Positives: use MIPS (Munich Information Center for Protein Sequences, HW
Mewes) complexes catalog: hand-curated list of complexes (8250 protein pairs that
are within the same complex) from biomedical literature.
Negatives:
- harder to define
- essential for successful training
Assume that proteins in different compartments do not interact.
Synthesize “negatives” from lists of proteins in separate subcellular compartments.
25. Lecture WS 2003/04
Bioinformatics III 9
Measure of reliability: likelihood ratio
Jansen et al. Science 302, 449 (2003)
Consider a genomic feature f expressed in binary terms (i.e. „absent“ or „present“).
Likelihood ratio L(f) is defined as:
L(f) = 1 means that the feature has no predictability: the same number of positives
and negatives have feature f.
The larger L(f) the better its predictability.
f
ffL
featurehavingnegativesstandardgoldoffraction
featurehavingpositivesstandardgoldoffraction
25. Lecture WS 2003/04
Bioinformatics III 10
Combination of features
Jansen et al. Science 302, 449 (2003)
For two features f1 and f2 with uncorrelated evidence,
the likelihood ratio of the combined evidence is simply the product:
L(f1,f2) = L(f1) L(f2)
For correlated evidence L(f1,f2) cannot be factorized in this way.
Bayesian networks are a formal representation of such relationships between
features.
The combined likelihood ratio is proportional to the estimated odds that two
proteins are in the same complex, given multiple sources of information.
25. Lecture WS 2003/04
Bioinformatics III 11
Prior and posterior odds
„positive“ : a pair of proteins that are in the same complex. Given the number of
positives among the total number of protein pairs, the „prior“ odds of finding a
positive are:
„posterior“ odds: odds of finding a positive after considering N datasets with values
f1 ... fN :
posP
posP
negP
posPOprior
1
N
Nprior ffnegP
ffposPO
...
...
1
1
The terms „prior“ and „posterior“ refer to the situation before and after knowing the
information in the N datasets.
Jansen et al. Science 302, 449 (2003)
25. Lecture WS 2003/04
Bioinformatics III 12
Static naive Bayesian Networks
In the case of protein-protein interaction data, the posterior odds describe the
odds of having a protein-protein interaction given that we have the information from
the N experiments,
whereas the prior odds are related to the chance of randomly finding a protein-
protein interaction when no experimental data is known.
If Opost > 1, the chances of having an interaction are
Jansen et al. Science 302, 449 (2003)
higher than having no interaction.
25. Lecture WS 2003/04
Bioinformatics III 13
Static naive Bayesian Networks
The likelihood ratio L defined as
relates prior and posterior odds according to Bayes‘ rule:
negffP
posffPffL
N
NN ...
......
1
11
priorNpost OffLO ...1
In the special case that the N features are conditionally independent
(i.e. they provide uncorrelated evidence) the Bayesian network is a so-called
„naïve” network, and L can be simplified to:
N
i
N
i i
iiN negfP
posfPfLffL
1 11...
Jansen et al. Science 302, 449 (2003)
25. Lecture WS 2003/04
Bioinformatics III 14
Computation of prior and posterior odds
L can be computed from contingency tables relating positive and negative
examples with the N features (by binning the feature values f1 ... fN into discrete
intervals) – wait for examples.
600
1
1018
1036
4
priorO
Opost > 1 can be achieved with L > 600.
Jansen et al. Science 302, 449 (2003)
Determining the prior odds Oprior is somewhat arbitrary in that it requires an
assumption about the number of positives.
Jansen et al. believe that 30,000 is a conservative lower bound for the number of
positives (i.e. pairs of proteins that are in the same complex).
Considering that there are ca. 18 million = 0.5 * N (N – 1) possible protein pairs in
total (with N = 6000 for yeast),
25. Lecture WS 2003/04
Bioinformatics III 15
Essentiality (PIP)
Consider whether proteins are essential or non-essential = does a deletion mutant
where this protein is knocked out from the genome have the same phenotype?
Jansen et al. Science 302, 449 (2003)
It should be more likely that both of 2 proteins in a complex are essential or non-
essential, but not a mixture of these two attributes.
Deletion mutants of either one protein should impair the function of the same
complex.
25. Lecture WS 2003/04
Bioinformatics III 16
Parameters of the naïve Bayesian Networks (PIP) Column 1 describes the genomic feature. In the „essentiality data“ protein pairs can take on 3 discrete
values (EE: both essential; NN: both non-essential; NE: one essential and one not).
Jansen et al. Science 302, 449 (2003)
Column 2 gives the number of protein pairs with a particular feature (i.e. „EE“) drawn from the whole yeast
interactome (~18M pairs).
Columns „pos“ and „neg“ give the overlap of these pairs with the 8,250 gold-standard positives and the
2,708,746 gold-standard negatives.
Columns „sum(pos)“ and „sum(neg)“ show how many gold-standard positives (negatives) are among the
protein pairs with likelihood ratio L, computed by summing up the values in the „pos“ (or „neg“) column.
P(feature value|pos) and P(feature value|neg) give the conditional probabilities of the feature values – and
L, the ratio of these two conditional probabilities.
143.0
518.0
2150
1114
573724
81924
25. Lecture WS 2003/04
Bioinformatics III 17
mRNA expression dataProteins in the same complex tend to have correlated expression profiles.
Although large differences can exist between the mRNA and protein abundance, protein abundance can
be indirectly and quite crudely measured by the presence or absence of the corresponding mRNA
transcript.
Jansen et al. Science 302, 449 (2003)
Experimental data source:
- time course of expression fluctuations during the yeast cell cycle
- Rosetta compendium: expression profiles of 300 deletion mutants and cells under
chemical treatments.
Problem: both data sets are strongly correlated.
Compute first principal component of the vector of the 2 correlations.
Use this as independent source of evidence for the P-P interaction prediction.
The first principal component is a stronger predictor of P-P interactions that either
of the 2 expression correlation datasets by themselves.
25. Lecture WS 2003/04
Bioinformatics III 18
mRNA expression dataThe values for mRNA expression correlation (first principal component) range on a
continuous scale from -1.0 to +1.0 (fully anticorrelated to fully correlated).
This range was binned into 19 intervals.
Jansen et al. Science 302, 449 (2003)
25. Lecture WS 2003/04
Bioinformatics III 19
PIP – Functional similarityQuantify functional similarity between two proteins:
Jansen et al. Science 302, 449 (2003)
- consider which set of functional classes two proteins share, given either the MIPS or Gene
Ontology (GO) classification system.
- Then count how many of the ~18 million protein pairs in yeast share the exact same
functional classes as well (yielding integer counts between 1 and ~ 18 million). It was binned
into 5 intervals.
- In general, the smaller this count, the more similar and specific is the functional description
of the two proteins.
25. Lecture WS 2003/04
Bioinformatics III 20
PIP – Functional similarity
Observation: low counts correlate with a higher chance of two proteins being in
the same complex. But signal (L) is quite weak.
Jansen et al. Science 302, 449 (2003)
25. Lecture WS 2003/04
Bioinformatics III 21
Calculation of the fully connected Bayesian network (PIE)
The 3 binary experimental interaction datasets can be combined in at most 24 = 16
different ways (subsets). For each of these 16 subsets, one can compute a
likelihood ratio from the overlap with the gold-standard positives („pos“) and
negatives („neg“).
51003.08250
26
2708746
2 8250
2708746
27087462825026
Jansen et al. Science 302, 449 (2003)
25. Lecture WS 2003/04
Bioinformatics III 22
Distribution of likelihood ratios
Number of protein pairs in the individual datasets and the probabilistic interactomes
as a function of the likelihood ratio.
There are many more protein pairs with high
likelihood ratios in the probabilistic interactomes
(PIE) than in the individual datasets G,H,U,I.
Protein pairs with high likelihood ratios provide
leads for further experimental investigation of
proteins that potentially form complexes.
Jansen et al. Science 302, 449 (2003)
25. Lecture WS 2003/04
Bioinformatics III 23
Jansen et al. Science 302, 449 (2003)
Overview
PIP and PIE are separately tested against the
gold-standard.
25. Lecture WS 2003/04
Bioinformatics III 24
PIP vs. the information sources
Ratio of true to false positives (TP/FP) increases
monotonically with Lcut, confirming L as an
appropriate measure of the odds of a real
interaction.
The ratio is computed as:
Protein pairs with Lcut > 600 have a > 50%
chance of being in the same complex.Jansen et al. Science 302, 449 (2003)
cut
cut
LL
LL
cut
cut
Lneg
Lpos
LFP
LTP
25. Lecture WS 2003/04
Bioinformatics III 25
PIE vs. the information sources
9897 interactions are predicted from PIP and
163 from PIE.
In contrast, likelihood ratios derived from single
genomic factors (e.g. mRNA coexpression) or
from individual interaction experiments (e.g. the
Ho data set) did no exceed the cutoff when used
alone.
This demonstrates that information sources that,
taken alone, are only weak predictors of
interactions can yield reliable predictions when
combined.
Jansen et al. Science 302, 449 (2003)
25. Lecture WS 2003/04
Bioinformatics III 26
parts of PIP graph
Test whether the thresholded PIP
was biased toward certain
complexes, compare distribution of
predictions among gold-standard
positives.
(A ) The complete set of gold-
standard positives and their overlap
with the PIP. The PIP (green) covers
27% of the gold-standard positives
(yellow).
The predicted complexes are roughly
equally apportitioned among the
different complexes no bias.Jansen et al. Science 302, 449 (2003)
25. Lecture WS 2003/04
Bioinformatics III 27
parts of PIP graph
Jansen et al. Science 302, 449 (2003)
Graph of the largest complexes in PIP, i.e. only
those proteins having 20 links.
(Left) overlapping gold-standard positives are
shown in green, PIE links in blue, and overlaps with
both PIE and gold-standard positives in black.
(Right) Overlapping gold-standard negatives are
shown in red. Regions with many red links indicate
potential false-positive predictions.
25. Lecture WS 2003/04
Bioinformatics III 28
experimental verification
Jansen et al. Science 302, 449 (2003)
conduct TAP-tagging experiments (Cellzome) for 98 proteins.
These produced 424 experimental interactions overlapping with the PIP
threshold at Lcut = 300.
Of these, 185 overlapped with gold-standard positives and 16 with negatives.
25. Lecture WS 2003/04
Bioinformatics III 29
Concentrate on large complexes
Jansen et al. Science 302, 449 (2003)
Sofar all interactions were treated as independent.
However, the joint distribution of interactions in the PIs can help identify large
complexes: an ideal complex should be a fully connected „clique“ in an
interaction graph.
In practice, this rarely happens because of incorrect or missing links.
Yet large complexes tend to have many interconnections between them,
whereas false-positive links to outside proteins tend to occur randomly, without a
coherent pattern.
25. Lecture WS 2003/04
Bioinformatics III 30
Improve ratio TP / FP
Observation: Increasing the minimum number of links raises TP/FP
by preserving the interactions among proteins in large complexes,
while filtering out false-positive interactions with heterogeneous
groups of proteins outside the complexes.
Jansen et al. Science 302, 449 (2003)
TP/FP for subsets of the
thresholded PIP that only include
proteins with a minimum number
of links. Requiring a minimum
number of links isolates large
complexes in the thresholded PIP
graph (Fig. 3B).
25. Lecture WS 2003/04
Bioinformatics III 31
Summary
In a similar manner, the approach could have been extended to a number of other
features related to interactions (e.g. phylogenetic co-occurrence, gene fusions,
gene neighborhood).
Jansen et al. Science 302, 449 (2003)
Bayesian approach allows reliable predictions of protein-protein interactions by
combining weakly predictive genomic features.
The de novo prediction of complexes replicated interactions found in the gold-
standard positives and PIE.
Also, several predictions were confirmed by new TAP experiments.
The accuracy of the PIP was comparable to that of the PIE while simultaneously
achieving greater coverage.
As a word of caution: Bayesian approaches don‘t work everywhere.
top related