comp598 f2016 lecture 19 - cs.mcgill.cajeromew/docs/comp598_f2016_lecture_19.pdf · reduc:onism vs....

16-11-09

1

COMP598: Advanced Computational Biology Methods & Research

System Biology

Jérôme Waldispühl School of Computer Science, McGill

(includes slides from M.Lavallée-Adam & B. Berger)

Whatisit?

Morel,N.M.,etal.MayoClinProc,2004.79(5):p.651-8.

“Thescienceofintegra:nggene:c,genomic,biochemical,cellular,physiologicalandclinicaldatatocreateasystemnetworkthatcanbeusedtopredic:velymodela

biologicalevent(s).”

Reduc:onismvs.SystemsBiology Inthe20thcentury…

NorbertWiener(1948)

Cyberne:cs“Thescienceof

communica:onsandautoma:ccontrolsystemsinboth

machineandlivingthings.”

(1952)Hudgkin&Huxley

Mathmodelexplainingtheac:onpoten:alpropaga:ngalongtheaxonofaneuronalcell.

(1968)MihajloMesarovicOrganizedthe1st

“SystemtheoryandBiology”symposium.Launchofanew

scien:ficdiscipline!

(1990s)

“-omicsrevolu:on”!

LudwigvonBertalanffy(1928)

GeneralSystemsTheory

“generalscienceofwholeness”

(1960)DenisNoble

Mathmodelofcardiaccells.

“-omicsrevolu:on”

Genomics Proteomics Metabolomics

Transcriptomics

FuncQonalproteomics/genomics

SYSTEMSBIOLOGY

Morel,N.M.,etal.MayoClinProc,2004.79(5):p.651-8.

Backtothecaranalogy

•  Howwoulduseasystemapproachtounderstandhowacarfunc:ons?

1.  Preliminaryunderstanding->formulateasimplemodel

2.  Defineallthecomponents:mechanical,electrical,andcontrol.

3.  Perturbthecarandcomparetonormalcar4.  Integratedataandcomparetoyourmodel5.  Discrepancies?->newhypothesis->repeatstep

3-5.

16-11-09

2

Atestsystem:galactoseu:liza:oninS.cerevisiae

•  9elements:–  4enzymescatalyzeconversionofgalactose(gal)toglucose-6-P

–  1transportermolecule•  Setsthestateofthesystem

–  4transcrip:onfactors(TFs)•  Turnsystemon/offdependingongalactosepresence/absence

Ideker,T.,etal.,Science,2001.292(5518):p.929-34.

Perturbthesystemandcompare

•  Yeaststrainused:– 9knock-out(KO)– 1wild-type(WT)

DNAMicroarray


Experimentaldata=model?


DecontaminatorModelingcontaminantsinAP-MS/MSexperiments

ProteininteracQonsobtainedbyTandemAffinityPurificaQon

Bait TagPreys

Contaminants

Background

2D-LC

Database search

MS/MS

SDS-PAGE

TAP

CellCulture

16-11-09

3

2D-LC

Database search

MS/MS

SDS-PAGE

TAP

CellCulture

Non-specificityofTAGan:body

FaultyPurifica:on

Misiden:fica:on

Carry-over

Over-expression

GelContamina:on

FalseposiQvesources

2D-LC

Database search

MS/MS

SDS-PAGE

TAP

CellCulture

Non-specificityofTAGan:body

FaultyPurifica:on

Misiden:fica:on

Carry-over

Over-expression

GelContamina:on

In-cellnormalexpression

Addi:onalpurifica:ons

LCcolumnwashing

Robotgelbandcugng

ExperimentalImprovements FalseposiQvesources

ComputaQonalFiltering

2D-LC

Database search

MS/MS

TAP

CellCulture

SDS-PAGE

2D-LC

Database search

MS/MS

TAP

CellCulture


SDS-PAGE

Kroganetal.,2006Chuaetal.,2006Ewingetal.,2007Clou:eretal.,2009Collinsetal.,2007

2D-LC

Database search

MS/MS

TAP

CellCulture



SDS-PAGE

Pep:de/ProteinProphet(Kelleretal.,Nesvizhskiietal.)Percolator(Kalletal.)

2D-LC

Database search

MS/MS

TAP

CellCulture

Pep:de/ProteinProphet(Kelleretal.,Nesvizhskiietal.)Percolator(Kalletal.)

DeContaminator(Lavallee-Adametal.,JPR)


SDS-PAGE


16-11-09

4

Manuallylabelallcontaminantsandsystema:callyremoveallinterac:onswiththeseproteins.Limita:ons:

• Acontaminantforonebaitmightbeatrueinterac:onforanother.

• Couldnotdetectsporadiccontamina:on.

SimplecontaminantdetecQon

!"# !$# !%# !&#'"# '$# '"# '$#'$# '&# '$# '&#'%# '(# '(# ')##

Baits

Preys

Twoexperimentsforagivenbaitb

• Inducedexperiment:expressionofthebaitvectorisinduced.• Controlexperiment:expressionofthebaitvectorisnotinduced.

MRa:omethod:Forapreyp:IfMS_Score(binduced,p)<5*MS_Score(bcontrol,p)pisacontaminantElse pistrulyinterac:ngwithb

[Jeronimoetal.,2007]Limita:ons:

• Expensivebothintermsof:meandresources• One-to-onecomparisonsofnoisylowabundancepreyMS/MSresults• Controlmightshowleakyexpressionofthebait

AlternatecontaminantdetecQonmethod

• Goal:Usealimitednumberofcontrolsfortheproperiden:fica:onofcontaminantsinTAP-MS/MSPPIdata.

• Advantages:• Noone-to-onecomparisonsofMSscoreshavetobeperformed.

• Accuratemodelingwithlimitedresourceusage.

• Usingalimitednumberofhigh-qualitycontrolsavoidsexpressionleakinessissues.

DeContaminator(Lavallee-Adametal.,JournalofProteomeResearch)

(Lavallee-Adametal.,JPR)

ObjecQve:Computetheposteriordistribu:onofMCMpgivenallMSscoreobserva:onsCMb,p∀b∈B

Discussion

Pr[M̄NIp |MNI

b1,p, ...,MNIb14,p]

MNIbi,p

Pr[M̄Ib,p|MI

b,p]

MIb,p

pvalue(MIb,p) = Pr[M̄I

b,p > M̄NIp |MI

b,p,MNIb1,p,M

NIb2,p, ...,M

NIb14,p]

Mascot Score

Our proposed Bayesian method shows an improvement in accuracy for the detection of con-

taminant PPIs in our dataset when compared to currently used alternate approaches. We expect

that this decrease in false positive interactions will facilitate the analysis of PPI networks and ease

the characterization of novel biological pathways. At the same time, our approach will greatly

reduce experimental costs by cutting the number of most experimental manipulations almost in

half. This expense reduction is due to the much smaller number of control experiments needed

by our algorithm compared to the methods described in Jeronimo et al,16 where each induced ex-

periment requires a matched non-induced experiment for its interactions to be classified. It is also

worth noting that in theory, the control experiments provided as input to the algorithm could all be

performed with the same bait protein. However, we used non-induced experiments produced from

different baits, by different experimentalists at different time periods. These biological and tech-

nical replicates allow us to factor in the noise resulting from the change of baits in TAP-MS/MS

experiments and technical variation.

Advantages

PPIs are often viewed and studied as a network. Several algorithms (e.g.13,14,16) use the topology

of this network to determine whether an interaction is a likely true or a false positive. The reasoning

is based on the fact that if two putatively interacting proteins also share similar sets of interacting

partners, they are more likely to form a complex and therefore to be truly interacting. However, this

20

3. For each pair (b, p) ⇤ B�P, assign a p-value to MIb,p:

pvalue(MIb,p) = Pr[M̄NI

p ⇥ M̄Ib,p|MI

b,p,MNIb1,p,M

NIb2,p, ...,M

NIb14,p]

4. Using the non-induced and full induced data sets, assign a false discovery rate to each p-

value.

Pr[MCMp|CMb1,p, ...,CMb1,p]

CMbi,p

IMb,p

Pr[MIMb,p|IMb,p]

pvalue(IMb,p) = Pr[MCMp ⇥MIMb,p|IMb,p,CMb1,p,CMb2,p, ...,CMb14,p]

Each step is detailed further below and illustrated in Figure 2.

Step 1: Building a noise model from non-induced experiments

The set of 14 TAP-MS/MS experiments where the bait’s expression was not induced can be seen as

a set of biological replicates of the null condition. We use these measurements to assess the amount

of noise in each replicate, i.e. to estimate Pr[MNIb,p|M̄NI

p ], the probability of a given observation MNIb,p,

10



p ⇥ M̄Ib,p|MI

b,p,MNIb1,p,M

NIb2,p, ...,M

NIb14,p]


value.


CMbi,p

IMb,p

Pr[MIMb,p|IMb,p]








10



p ⇥ M̄Ib,p|MI

b,p,MNIb1,p,M

NIb2,p, ...,M

NIb14,p]


value.


CMbi,p

IMb,p

Pr[MIMb,p|IMb,p]








10



p ⇥ M̄Ib,p|MI

b,p,MNIb1,p,M

NIb2,p, ...,M

NIb14,p]


value.


CMbi,p

IMb,p

Pr[MIMb,p|IMb,p]








10



p ⇥ M̄Ib,p|MI

b,p,MNIb1,p,M

NIb2,p, ...,M

NIb14,p]


value.


CMbi,p

IMb,p

Pr[MIMb,p|IMb,p]








10



p ⇥ M̄Ib,p|MI

b,p,MNIb1,p,M

NIb2,p, ...,M

NIb14,p]


value.


CMbi,p

MCMp

IMb,p

Pr[MIMb,p|IMb,p]



10

ControlMSscoremeandistribu:on

InducedMSscoremeandistribu:on

ModelingContaminantsNon-inducedresultsfromthesetofbaitsBarepooled.Usingaweightedk-nearestneighbourssmoothingofthefrequencyofeachMCMpvalue,condi:onalonCMb,pvalues∀b∈B,weobtainanes:mateof:

AqerBayesrule:

Theposteriordistribu:onofMCMpscoresisthen:

Theposteriordistribu:onofMIMb,piscomputedinasimilarfashion:

ModelingContaminants

Pr[CMb,p = cm|MCMp = mcm]

Pr[MCMp|CMb1,p, ...,CMb14,p] = Pr[MCMp]14

�i=1

Pr[CMbi, p|MCMp]/⇥





of noise in each replicate, i.e. to estimate Pr[MNIb,p|M̄

NIp ], the probability of a given observation MNI

b,p,

given its true mean Mascot score M̄NIp . This distribution is estimated using a leave-one-out cross-

validation approach on the set of 14 non-induced experiments. Specifically, for each bait b⇧ B, we

compare MNIb,p to µ⌃=b,p, the corrected average (see Supplementary Information) of the 13 Mascot

scores of p in all non-induced experiments except where bait b was used. µ⌃=b,p provides a good

estimate of M̄NIp . Let C(i, j) be the number bait-prey pairs for which ⌥MNI

b,p� = i and ⌥µ⌃=b,p� = j.

Then, a straight-forward estimator is

Pr[MNIb,p = x|M̄NI

p = y] = C(x,y)/⇥y⌅

C(x,y⌅).

Note that C is a fairly large matrix (the number of rows and columns is set to 1000; larger Mascot

scores are culled to 1000). In addition, aside from the zero-th column C(⇥,0), it is quite sparsely

populated, as the sum of all entries is 40306. Thus, the above formula yields a very poor estimator.

Matrix C therefore needs to be smoothed to matrix Cs using a k-nearest neighbors smoothing algo-

rithm. Specifically, let N� (i, j) = {(i⌅, j⌅) : |i� i⌅|⇤ � , | j� j⌅|⇤ �} be the set of neighboring matrix

11



�i=1

Pr[CMbi, p|MCMp]/⇥







b,p,


validation approach on the set of 14 non-induced experiments. Specifically, for each bait b⇧ B, we

compare MNIb,p to µ⌃=b,p, the corrected average (see Supplementary Information) of the 13 Mascot

scores of p in all non-induced experiments except where bait b was used. µ⌃=b,p provides a good

estimate of M̄NIp . Let C(i, j) be the number bait-prey pairs for which ⌥MNI

b,p� = i and ⌥µ⌃=b,p� = j.


Pr[MNIb,p = x|M̄NI

p = y] = C(x,y)/⇥y⌅

C(x,y⌅).


scores are culled to 1000). In addition, aside from the zero-th column C(⇥,0), it is quite sparsely

populated, as the sum of all entries is 40306. Thus, the above formula yields a very poor estimator.

Matrix C therefore needs to be smoothed to matrix Cs using a k-nearest neighbors smoothing algo-

rithm. Specifically, let N� (i, j) = {(i⌅, j⌅) : |i� i⌅|⇤ � , | j� j⌅|⇤ �} be the set of neighboring matrix

11



p ⇥ M̄Ib,p|MI

b,p,MNIb1,p,M

NIb2,p, ...,M

NIb14,p]


value.


CMbi,p

MCMp

IMb,p

Pr[MIMb,p|IMb,p]


Pr[MCMp = mcm|CMb,p = cm]

10



�i=1

Pr[CMbi, p|MCMp]/�

Pr[MIMb,p|IMb,p]








validation approach on the set of 14 non-induced experiments. Specifically, for each bait b ⇤ B, we

compare MNIb,p to µ⌅=b,p, the corrected average (see Supplementary Information) of the 13 Mascot

scores of p in all non-induced experiments except where bait b was used. µ⌅=b,p provides a good

estimate of M̄NIp . Let C(i, j) be the number bait-prey pairs for which ⇧MNI

b,p⌃ = i and ⇧µ⌅=b,p⌃ = j.


Pr[MNIb,p = x|M̄NI

p = y] = C(x,y)/⇥y⇥

C(x,y⇥).


scores are culled to 1000). In addition, aside from the zero-th column C(�,0), it is quite sparsely

11

16-11-09

5

p-valuethatpreypisacontaminantforbaitbFalseDiscoveryRate(FDR)foraninterac:onwithagivenp-value:

NPbandIPbarethesetsofnon-inducedandinducedinterac:onp-values.

ContaminantsAssessment



p ⇥ M̄Ib,p|MI

b,p,MNIb1,p,M

NIb2,p, ...,M

NIb14,p]


value.


CMbi,p

MCMp

IMb,p

Pr[MIMb,p|IMb,p]


Pr[MCMp = mcm|CMb,p = cm]

10



�i=1

Pr[CMbi, p|MCMp]/�

Pr[MIMb,p|IMb,p]

FDR(p-value) =⇥b⇥B

|{np⇥NPb|np�p-value}||NPb|

⇥b⇥B|{ip⇥IPb|ip�p-value}|

|IPb|







b,p,


validation approach on the set of 14 non-induced experiments. Specifically, for each bait b⇥ B, we

compare MNIb,p to µ⇤=b,p, the corrected average (see Supplementary Information) of the 13 Mascot

scores of p in all non-induced experiments except where bait b was used. µ⇤=b,p provides a good

estimate of M̄NIp . Let C(i, j) be the number bait-prey pairs for which ⌅MNI

b,p⇧ = i and ⌅µ⇤=b,p⇧ = j.

11

Protein-ProteinInteracQonNetwork• 89baitsand11894interac:ons[For:er,Lacombeetal.,2010]• Humancellline:HEK293• Proteinsinthenetworkaremainlyinvolvedintranscrip:onandRNAprocessing.• 14representa:vebaitsoutofthe89havebeenselectedforcontrolexperiments.

0

1000

2000

3000

4000

5000

6000

7000

8000

0 0.1 0.2 0.3 0.4 0.5

FDR

Nu

mb

er o

f p

red

icte

d i

nte

racti

on

s

Z-score

DeContaminator

DeContaminator:2430interac:onsZ-scoreapproach:1011interac:ons

FalseDiscoveryRates

FDR1%: IsoRankComparisonofPPInetworks

Comparative Genomics

Look at the same kind of data across species with the hope that areas of high correlation correspond to functional parts or modules of the genome.

Why understanding function-level differences is important

•  Increased complexity (function) is not explained simply by variations in gene (or protein) count

6600 21000 14000 24500 23000

6600 27000 19000 32000 49000

Estimated Number of Genes

Estimated Number of Proteins

Numbers from h,p://www.ensembl.org

16-11-09

6

Protein-Protein Interactions (PPIs) •  Often, proteins interact with other proteins to

perform their functions •  Many cellular activities are a result of protein

interactions

Image from:h,p://focosi.altervista.org/mapkmap2.html

MAPK Signaling Cascade

Modeling PPIs •  Traditional perspective: low-throughput, structural •  New perspective: high-throughput, network-based

Image from www.rcsb.org

Gα Gβ

GγGDP

G-protein complex

New systems-level perspective

Gα

Gβ Gγ

GDP

Traditional perspective

Protein-Protein Interaction (PPI) Network

http://internal.binf.ku.dk

Yeast PPI Network

Cusic

k et

al.

Hum

Med

Gen

, 05

X + = ?Y

Yeast 2-Hybrid method

Motivation behind Network Comparison

•  Compare PPI networks at the species level

•  Transfer annotation from one species to another

–  More feasible, cheaper and easier than in humans

–  Error detection

•  Compute functional orthologs

–  Functional orthologs: proteins which perform the same

function across species

Given two protein-protein interaction networks, find for a piece of one network, something that has a comparative structure in the other network

Our approach: match neighborhood topologies

The Problem Algorithm: IsoRank a1

a3 a8

a4

a7

a6

a2

a5

b2 b3

b1

b8

b5

b7

b6

b4 b9

Sequence similarity

3e-9 b6 a3

5e-4 b1 a3

1e-4 b9 a5

1e-7 b3 a5

…

2e-8 b1 a5

1e-2 b7 a5

Functional similarity for each possible node pairing

a5 b7 2.1

a5 b9 1.5

a3 b2 3.4

16-11-09

7

Functional Similarity Score: Intuition

•  Compute pairwise scores Rij:

•  Goal: “high Rij” ⇒ “i and j are a good match” •  Intuition: i and j are a good match if their

sequences align and their neighbors are a good match

b3

b1

b2

b4

b5 a1 a3

a4 a2

a5 Ra5,b1 = ?

Computing Rij •  Combine both sequence and network data

Rij = Eij

functional similarity

sequence similarity

network similarity

Rij = (1-α)Eij+αNij

sequence similarity

Simple Case: α=1 (no Eij)

∑ ∑∈ ∈

=)( )( )()(

1iNu jNv

uvij RvNuN

R

b3

b1

b2

b4

b5

a1 a3 a4

a2

a5 3,24,1 321

baba RR×

=

a1 a2 b3

b4

∑ ∑∈ ∈

==)( )( )()(

1iNu jNv

uvijij RvNuN

NR

•  Rij=Nij. Rij depends on neighborhoods of i and j

•  N(a) is the set of neighbors of a

Simple case: α=1 (no Eij) •  Rij=Nij. Rij depends on neighborhoods of i and j

•  N(a) is the set of neighbors of a

∑ ∑∈ ∈

==)( )( )()(

1iNu jNv

uvijij RvNuN

NR

b3

b1

b2

b4

b5

a1 a3 a4

a2

a5

3,31,3

3,11,12,2

331

131

311

111

baba

bababa

RR

RRR

×+

×+

×+

×=

a1 a3 a2

b3

b1

b2

Example: Computed Rij values

b3

b1

b2

b4

b5

a1 a3 a4

a2

a5 b1 b2 b3 b4 b5

a1 0.0312 0.0937

a2 0.1250 0.0625 0.0625

a3 0.0937 0.2813

a4 0.0625 0.0312 0.0312

a5 0.0625 0.0312 0.0312

Empty cell indicates Rij = 0

R Example: Computed Rij values

b3

b1

b2

b4

b5

a1 a3 a4

a2

a5 b1 b2 b3 b4 b5

a1 0.0312 0.0937

a2 0.1250 0.0625 0.0625

a3 0.0937 0.2813

a4 0.0625 0.0312 0.0312

a5 0.0625 0.0312 0.0312


R

16-11-09

8

Example: Computed Rij values

b3

b1

b2

b4

b5

a1 a3 a4

a2

a5 b1 b2 b3 b4 b5

a1 0.0312 0.0937

a2 0.1250 0.0625 0.0625

a3 0.0937 0.2813

a4 0.0625 0.0312 0.0312

a5 0.0625 0.0312 0.0312


R Capturing non-local effects?

•  The algorithm can resolve between p-r vs. p-q

q

p

r Rpr=8.12e-3 Rpq=8.64e-3

Computing R: an eigenvalue problem

2121)()()(

1]][[

NNNNAsizevNuN

uvijA

ARR

×=

=

=

N1 = # nodes in Graph 1 N2 = # nodes in Graph 2

•  A is about 108x108 when aligning yeast and fly networks –  However, both A and R are very sparse –  We use the Power method to efficiently compute R

•  Extension to weighted edges is straightforward

•  The equations for R describe an eigenvalue problem

R is the principal eigenvector of A

∑ ∑∈ ∈

=)( )( )()(

1iNu jNv

uvij RvNuN

R

A Random Walk Interpretation

Tensor Product: G1 x G2

r p

s

v

j q

i

u G1

G2

)()(1

vNuN

)()(1

jNiN

r,s r,j r,v

u,s u,j u,v

i,s i,j i,v

… …… …

… …

………

………

General Case: 0 ≤ α ≤ 1

•  Let Bij = sequence similarity score between

i (from graph #1) and j from (graph #2)

•  Eij = Bij/|B|1

ARR = 10)1(

≤≤

+−=

α

αα ARER

Results: Yeast-Fly Global Alignment •  # of edges in the common subgraph: 1420

•  Implies about 5% overlap! Why so low? •  PPI data currently is noisy and low-coverage

•  # of edges in the largest component: 35

•  The value of α used: 0.6 •  Provided best overall agreement with previous gene

correspondence predictions

16-11-09

9

Various Topologies Are Found

Existing local alignment methods (PathBlast; Kelley et al.) often find only specific topologies

Role of α: why the dip?

Robustness to Error in PPI data

a1

a3 a8

a4

a7

a6

a2

a5

a9 a11

a10

a1

a3 a8

a4

a7

a6

a2

a5

a9 a11

a10

? Robustness to Error in PPI data

True curve somewhere around here

Functional Orthologs •  Genes that perform similar functions

–  “functional orthologs” vs “plain old orthologs”

–  distinguish between orthologs and paralogs

•  Bandyopadhyay et al. [Genome Res. ’06]

–  Use local network alignment results

–  Then use a MRF to partially resolve ambiguities

•  We compared our results with theirs

Functional Orthologs: IsoRank Pairwise Alignment Predictions

Protein Functional Ortholog

IsoRank Bandyopadhyay et al.

Gid8 CG6617 CG6617 76% CG18467 ---

Gpa1 Goα47a Goα47a 41% Giα65a ---

Kap104 Trn Trn 41% CG8219 47%

CG18617 Vph1 Vph1 43% Stv1 48%

Egd1 Bic Bcd 47%

comp598 f2016 lecture 19 - cs.mcgill.cajeromew/docs/comp598_f2016_lecture_19.pdf · reduc:onism vs....

Documents