integrating protein-protein interactions: bayesian networks

25. Lecture WS 2003/04

Bioinformatics III 1

Integrating Protein-Protein Interactions: Bayesian Networks

- Lot of direct experimental data coming about protein-protein interactions

(Y2H, MS)

Jansen et al. Science 302, 449 (2003)

- Genomic information also provides indirect information:

- interacting proteins are often significantly coexpressed ( microarrays)

- interacting proteins are often colocalized to the same subcellular

compartment

Problems

Unfortunately, interaction data sets are often incomplete and contradictory (von

Mering et al. 2002)

In the context of genome-wide analyses, these inaccuracies are greatly magnified

because the protein pairs that do not interact (negatives) by far outnumber those

that do interact (positives).

E.g. in yeast, the ~6000 proteins allow for N (N-1) / 2 ~ 18 million potential

interactions. But the estimated number of actual interactions is < 100.000.

Therefore, even reliable techniques can generate many false positives when

applied genome-wide.

Think of a diagnostic with a 1% false-positive rate for a rare disease occurring in

0.1% of the population. This would roughly produce 1 true positive for every 10

false ones.

Integrative Approach

One would like to integrate evidence from many different sources to increase the

predictivity of true and false protein-protein predictions.

Here, use Bayesian approach for integrating interaction information that allows for

the probabilistic combination of multiple data sets; apply to yeast.

Input: Approach can be used for combining noisy genomic interaction data sets.

Normalization: Each source of evidence for interactions is compared against

samples of known positives and negatives (“gold-standard”).

Output: predict for every possible protein pair likelihood of interaction.

Verification: test on experimental interaction data not included in the gold-

standard + new TAP (tandem affinity purification experiments).

Integration of various information sources

(iii) Gold-standards of known interactions

and noninteracting protein pairs.

The three different types of data

used: (i) Interaction data from

high-throughput experiments.

These comprise large-scale two-

hybrid screens (Y2H) (Uetz et al.,

Ito et al.) and in vivo pull-down

experiments (Gavin et al., Ho et

al. ).

(ii) Other genomic features. We

considered expression data,

biological function of proteins

(from Gene Ontology biological

process and the MIPS functional

catalog), and data about whether

proteins are essential.

Combination of data sets into probabilistic interactomes

(B) Combination of data sets into

probabilistic interactomes.

The 4 interaction data sets

from HT experiments were

combined into 1 PIE.

The PIE represents a

transformation of the

individual binary-valued

interaction sets into a data

set where every protein pair

is weighed according to the

likelihood that it exists in a

complex. A „naïve” Bayesian network is used to model

the PIP data. These information sets hardly

overlap.

Because the 4 experimental

interaction data sets contain

correlated evidence, a fully

connected Bayesian network

is used.

Bayesian Networks

Bayesian networks are probabilistic models that graphically encode probabilistic

dependencies between random variables.Y

E1 E2E3

Bayesian networks also include a quantitative measure of dependency. For each

variable and its parents this measure is defined using a conditional probability

function or a table.

Here, one such measure is the proabability Pr(E1|Y).

A directed arc between variables

Y and E1 denotes conditional

dependency of E1 on Y, as

determined by the direction of

the arc.

Bayesian Networks

Together, the graphical structure and the conditional probability functions/tables

completely specify a Bayesian network probabilistic model.

E1 E2E3

Here, Pr(Y,E1,E2,E3) = Pr(E1|Y) Pr(E2|Y) Pr(E3|Y) Pr(Y)

This model, in turn, specifies a

particular factorization of the joint

probability distribution function

over the variables in the

networks.

Gold-Standard

should be

(i) independent from the data sources serving as evidence

(ii) sufficiently large for reliable statistics

(iii) free of systematic bias (e.g. towards certain types of interactions).

Positives: use MIPS (Munich Information Center for Protein Sequences, HW

Mewes) complexes catalog: hand-curated list of complexes (8250 protein pairs that

are within the same complex) from biomedical literature.

Negatives:

- harder to define

- essential for successful training

Assume that proteins in different compartments do not interact.

Synthesize “negatives” from lists of proteins in separate subcellular compartments.

Measure of reliability: likelihood ratio

Consider a genomic feature f expressed in binary terms (i.e. „absent“ or „present“).

Likelihood ratio L(f) is defined as:

L(f) = 1 means that the feature has no predictability: the same number of positives

and negatives have feature f.

The larger L(f) the better its predictability.

featurehavingnegativesstandardgoldoffraction

featurehavingpositivesstandardgoldoffraction

Combination of features

For two features f1 and f2 with uncorrelated evidence,

the likelihood ratio of the combined evidence is simply the product:

L(f1,f2) = L(f1) L(f2)

For correlated evidence L(f1,f2) cannot be factorized in this way.

Bayesian networks are a formal representation of such relationships between

features.

The combined likelihood ratio is proportional to the estimated odds that two

proteins are in the same complex, given multiple sources of information.

Prior and posterior odds

„positive“ : a pair of proteins that are in the same complex. Given the number of

positives among the total number of protein pairs, the „prior“ odds of finding a

positive are:

„posterior“ odds: odds of finding a positive after considering N datasets with values

f1 ... fN :

posPOprior

Nprior ffnegP

ffposPO

The terms „prior“ and „posterior“ refer to the situation before and after knowing the

information in the N datasets.

Static naive Bayesian Networks

In the case of protein-protein interaction data, the posterior odds describe the

odds of having a protein-protein interaction given that we have the information from

the N experiments,

whereas the prior odds are related to the chance of randomly finding a protein-

protein interaction when no experimental data is known.

If Opost > 1, the chances of having an interaction are

higher than having no interaction.

Static naive Bayesian Networks

The likelihood ratio L defined as

relates prior and posterior odds according to Bayes‘ rule:

negffP

posffPffL

NN ...

......

priorNpost OffLO ...1

In the special case that the N features are conditionally independent

(i.e. they provide uncorrelated evidence) the Bayesian network is a so-called

„naïve” network, and L can be simplified to:

iiN negfP

posfPfLffL

1 11...

Computation of prior and posterior odds

L can be computed from contingency tables relating positive and negative

examples with the N features (by binning the feature values f1 ... fN into discrete

intervals) – wait for examples.

priorO

Opost > 1 can be achieved with L > 600.

Determining the prior odds Oprior is somewhat arbitrary in that it requires an

assumption about the number of positives.

Jansen et al. believe that 30,000 is a conservative lower bound for the number of

positives (i.e. pairs of proteins that are in the same complex).

Considering that there are ca. 18 million = 0.5 * N (N – 1) possible protein pairs in

total (with N = 6000 for yeast),

Essentiality (PIP)

Consider whether proteins are essential or non-essential = does a deletion mutant

where this protein is knocked out from the genome have the same phenotype?

It should be more likely that both of 2 proteins in a complex are essential or non-

essential, but not a mixture of these two attributes.

Deletion mutants of either one protein should impair the function of the same

complex.

Parameters of the naïve Bayesian Networks (PIP) Column 1 describes the genomic feature. In the „essentiality data“ protein pairs can take on 3 discrete

values (EE: both essential; NN: both non-essential; NE: one essential and one not).

Column 2 gives the number of protein pairs with a particular feature (i.e. „EE“) drawn from the whole yeast

interactome (~18M pairs).

Columns „pos“ and „neg“ give the overlap of these pairs with the 8,250 gold-standard positives and the

2,708,746 gold-standard negatives.

Columns „sum(pos)“ and „sum(neg)“ show how many gold-standard positives (negatives) are among the

protein pairs with likelihood ratio L, computed by summing up the values in the „pos“ (or „neg“) column.

P(feature value|pos) and P(feature value|neg) give the conditional probabilities of the feature values – and

L, the ratio of these two conditional probabilities.

573724

mRNA expression dataProteins in the same complex tend to have correlated expression profiles.

Although large differences can exist between the mRNA and protein abundance, protein abundance can

be indirectly and quite crudely measured by the presence or absence of the corresponding mRNA

transcript.

Experimental data source:

- time course of expression fluctuations during the yeast cell cycle

- Rosetta compendium: expression profiles of 300 deletion mutants and cells under

chemical treatments.

Problem: both data sets are strongly correlated.

Compute first principal component of the vector of the 2 correlations.

Use this as independent source of evidence for the P-P interaction prediction.

The first principal component is a stronger predictor of P-P interactions that either

of the 2 expression correlation datasets by themselves.

mRNA expression dataThe values for mRNA expression correlation (first principal component) range on a

continuous scale from -1.0 to +1.0 (fully anticorrelated to fully correlated).

This range was binned into 19 intervals.

PIP – Functional similarityQuantify functional similarity between two proteins:

- consider which set of functional classes two proteins share, given either the MIPS or Gene

Ontology (GO) classification system.

- Then count how many of the ~18 million protein pairs in yeast share the exact same

functional classes as well (yielding integer counts between 1 and ~ 18 million). It was binned

into 5 intervals.

- In general, the smaller this count, the more similar and specific is the functional description

of the two proteins.

PIP – Functional similarity

Observation: low counts correlate with a higher chance of two proteins being in

the same complex. But signal (L) is quite weak.

Calculation of the fully connected Bayesian network (PIE)

The 3 binary experimental interaction datasets can be combined in at most 24 = 16

different ways (subsets). For each of these 16 subsets, one can compute a

likelihood ratio from the overlap with the gold-standard positives („pos“) and

negatives („neg“).

51003.08250

2708746

2 8250

2708746

27087462825026

Distribution of likelihood ratios

Number of protein pairs in the individual datasets and the probabilistic interactomes

as a function of the likelihood ratio.

There are many more protein pairs with high

likelihood ratios in the probabilistic interactomes

(PIE) than in the individual datasets G,H,U,I.

Protein pairs with high likelihood ratios provide

leads for further experimental investigation of

proteins that potentially form complexes.

Overview

PIP and PIE are separately tested against the

gold-standard.

PIP vs. the information sources

Ratio of true to false positives (TP/FP) increases

monotonically with Lcut, confirming L as an

appropriate measure of the odds of a real

interaction.

The ratio is computed as:

Protein pairs with Lcut > 600 have a > 50%

chance of being in the same complex.Jansen et al. Science 302, 449 (2003)

PIE vs. the information sources

9897 interactions are predicted from PIP and

163 from PIE.

In contrast, likelihood ratios derived from single

genomic factors (e.g. mRNA coexpression) or

from individual interaction experiments (e.g. the

Ho data set) did no exceed the cutoff when used

alone.

This demonstrates that information sources that,

taken alone, are only weak predictors of

interactions can yield reliable predictions when

combined.

parts of PIP graph

Test whether the thresholded PIP

was biased toward certain

complexes, compare distribution of

predictions among gold-standard

positives.

(A ) The complete set of gold-

standard positives and their overlap

with the PIP. The PIP (green) covers

27% of the gold-standard positives

(yellow).

The predicted complexes are roughly

equally apportitioned among the

different complexes no bias.Jansen et al. Science 302, 449 (2003)

parts of PIP graph

Graph of the largest complexes in PIP, i.e. only

those proteins having 20 links.

(Left) overlapping gold-standard positives are

shown in green, PIE links in blue, and overlaps with

both PIE and gold-standard positives in black.

(Right) Overlapping gold-standard negatives are

shown in red. Regions with many red links indicate

potential false-positive predictions.

experimental verification

conduct TAP-tagging experiments (Cellzome) for 98 proteins.

These produced 424 experimental interactions overlapping with the PIP

threshold at Lcut = 300.

Of these, 185 overlapped with gold-standard positives and 16 with negatives.

Concentrate on large complexes

Sofar all interactions were treated as independent.

However, the joint distribution of interactions in the PIs can help identify large

complexes: an ideal complex should be a fully connected „clique“ in an

interaction graph.

In practice, this rarely happens because of incorrect or missing links.

Yet large complexes tend to have many interconnections between them,

whereas false-positive links to outside proteins tend to occur randomly, without a

coherent pattern.

Improve ratio TP / FP

Observation: Increasing the minimum number of links raises TP/FP

by preserving the interactions among proteins in large complexes,

while filtering out false-positive interactions with heterogeneous

groups of proteins outside the complexes.

TP/FP for subsets of the

thresholded PIP that only include

proteins with a minimum number

of links. Requiring a minimum

number of links isolates large

complexes in the thresholded PIP

graph (Fig. 3B).

Summary

In a similar manner, the approach could have been extended to a number of other

features related to interactions (e.g. phylogenetic co-occurrence, gene fusions,

gene neighborhood).

Bayesian approach allows reliable predictions of protein-protein interactions by

combining weakly predictive genomic features.

The de novo prediction of complexes replicated interactions found in the gold-

standard positives and PIE.

Also, several predictions were confirmed by new TAP experiments.

The accuracy of the PIP was comparable to that of the PIE while simultaneously

achieving greater coverage.

As a word of caution: Bayesian approaches don‘t work everywhere.

integrating protein-protein interactions: bayesian networks

expression data

pip data

interaction information

direct experimental

different types of data

bayesian approach

noninteracting protein

genomic information

Documents

bayesian segmental models with multiple sequence...

a bayesian framework for combining protein and network...

integrating multiple omics analyses identifies serological...

protein function prediction by integrating sequence

bayesian model of protein primary sequence for secondary...

objective bayesian nets for integrating cancer knowledge

software open access integrating protein structural dynamics...

6. lecture ss 20005cell simulations1 v6: the...

bayesian classification of protein data

integrating bayesian networks and simpson’s paradox in...

bayesian model of protein primary sequence for secondary...

hanford 300 area integrating scale-dependent hydrogeological...

bayesian classification of protein data thomas huber...

bayesian protein structure alignmentbayesian protein...

integrating cross-linking experiments with ab initio protein...

a comprehensive resource for integrating and displaying...

bayesian statistical analysis of protein side‐chain

35 integrating information for protein function prediction

bayesian protein structure prediction

integrating protein annotations for the in silico...