hugh shanahan association talk
DESCRIPTION
Talk I gave at this year\'s CCC Summer School at Zhejiang University, HangzhouTRANSCRIPT
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Associative methods in Systems Biology
Hugh Shanahan
Department of Computer ScienceRoyal Holloway, University of London
September 22, 2009
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Outline
1 Outline
2 Gene OntologiesOver-representationSemantic similarity
3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures
4 ValidationDREAM
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Gene Ontologies
Before finding interactions, need to be able
to systematically annotate all genesto determine which functional groupings areover-representedmeasure objectively the “functional similarity” of twogenes.
Gene Ontology (GO) is a means to do this.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Ontologies
Abstract method for expressing structured data.Annotation of any gene can be expressed in terms ofincresingly accurate description, e.g.This gene is involved in transport.This gene is involved in vesicle mediatedtransport.This gene is involved in vesicle fusion.Genes may not have an accurate annotation, sodefinition can stop higher up in this hierarchy.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
More complexity required
Annotation is not a simple chain.A single gene can have have a very specific annotation,which comes from two (or more) more generaldescriptions.Different types of annotation as well: location,biochemistry, part of organism expressed in, and so on.An Ontology is a Directed Acyclic Graph (DAG), not aTree (this means a lot to Graph Theorists).Each node in the DAG is an annotation term.Each “child” node can more than one “parent” nodes.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
GO’s visualised
Nature Reviews | Genetics
Biologicalprocess (root)
Vesicle fusion
part_of is_a
Vesicle-mediated transport Membrane fusion
Transport Membrane organizationand biogenesis
is_ais_a
cbParent
Child
Increasingspecificityand/orgranularity
a
not been shown robustly. As of October 2007, there are over 16 million GO annotations. Strikingly, over 95% of these annotations are computationally derived and have not been manually curated; these are associated with the ‘inferred from electronic annotation’ (IEA) evidence code. Most of these annotations come from the GO annotation project at the European Bioinformatics Institute (GOA9). In addition to the GOA set, individual model organisms also have a substantial portion of annotations not derived
from direct experimental evidence (TABLE 2). Among the 27 organisms with more than 5,000 annotations, the portion of genes with at least one experimentally derived annotation varies widely from 1.1% to 90.9%. Although computational and indirectly derived annotations increase coverage significantly, they probably contain a higher portion of false positives. Researchers who use GO anno-tations should be cognizant of the differences between annotations associated with different evidence codes.
Figure 1 | Simple trees versus directed acyclic graphs. Boxes represent nodes and arrows represent edges. a | An example of a simple tree, in which each child has only one parent and the edges are directed, that is, there is a source (parent) and a destination (child) for each edge. b | A directed acyclic graph (DAG), in which each child can have one or more parents. The node with multiple parents is coloured red and the additional edge is coloured grey. c | An example of a node, vesicle fusion, in the biological process ontology with multiple parentage. The dashed edges indicate that there are other nodes not shown between the nodes and the root node (biological process). A root is a node with no incoming edges, and at least one leaf (also called a sink). A leaf node is a node with no outgoing edges, that is, a terminal node with no children (vesicle fusion). Similar to a simple tree, A DAG has directed edges and does not have cycles, that is, no path starts and ends at the same node, and will always have at least one root node. The depth of a node is the length of the longest path from the root to that node, whereas the height is the length of the longest path from that node to a leaf41. is_a and part_of are types of relationships that link the terms in the GO ontology. More information about the relationships between GO terms are found online (An Introduction to the Gene Ontology).
Table 1 | Evidence codes used by GO
Evidence code
Evidence code description Source of evidence Manually checked
Current number of annotations*
IDA Inferred from direct assay Experimental Yes 71,050
IEP Inferred from expression pattern Experimental Yes 4,598
IGI Inferred from genetic interaction Experimental Yes 8,311
IMP Inferred from mutant phenotype Experimental Yes 61,549
IPI Inferred from physical interaction Experimental Yes 17,043
ISS Inferred from sequence or structural similarity Computational Yes 196,643
RCA Inferred from reviewed computational analysis Computational Yes 103,792
IGC Inferred from genomic context Computational Yes 4
IEA Inferred from electronic annotation Computational No 15,687,382
IC Inferred by curator Indirectly derived from experimental or computational evidence made by a curator
Yes 5,167
TAS Traceable author statement Indirectly derived from experimental or computational evidence made by the author of the published article
Yes 44,564
NAS Non-traceable author statement No ‘source of evidence’ statement given Yes 25,656
ND No biological data available No information available Yes 132,192
NR Not recorded Unknown Yes 1,185*October 2007 release
REVIEWS
510 | JULY 2008 | VOLUME 9 www.nature.com/reviews/genetics
Rhee et al., Nature Reviews Genetics, (2008)
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
GO’s visualised
http://amigo.geneontology.org/
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Different types of Annotation
Typically, there are three distinct ontologies(overwhelmingly used).Cellular CompartmentBiological ProcessMolecular FunctionMany other ontologies have been constructed, e.g.Plant Organ for Arabidopsis.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Caveat
The annotation of most genes (90%) have been carried outcomputationally. The annotations usually work pretty well,though they have a tendency not to be as accurate as thoseobtained by direct assay.All annotated genes have an evidence code (IED)associated with them in order to demonstrate how much wecan rely on it.Increasingly, co-expression is being used as a means toannotate genes, so one should be careful in not using thisinformation in constructing annotations !
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Outline
1 Outline
2 Gene OntologiesOver-representationSemantic similarity
3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures
4 ValidationDREAM
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Over-representation
One of the most useful tools to hand when one analysesmicro-array data is to ask what functional groupings occurmore often than one expects.
NotationN number of genes in the genome.n number of genes which have been found to bedifferentially expressed.m number of genes in the genome with a specificannotation.k number of genes which are differentially expressedwith the same annotation.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Probabilities
One can derive the probability Pk that k genes would befound by chance amongst n genes using thehypergeometric probability distribution and the aboveparameters.For the record
Pk =mCk
N−mCn−kNCn
, (1)
NCm =N!
(N − n)!n!. (2)
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Difficulties
There are thousand’s of possible GO terms and oneshould adjust the probabilities to deal with multiplehypotheses.Applying Bonferroni correction using all GO terms givesa p-value of 10−7 equivalent to 1% significence.Because of the structure of the GO terms theseprobabilities are highly correlated with each other.In all these cases focussing on as short a list ofpossible biological processes as possible will minimisethe above difficulties.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Outline
1 Outline
2 Gene OntologiesOver-representationSemantic similarity
3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures
4 ValidationDREAM
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
What genes match
In benchmarking methods to infer interactions betweengene products, we expect genes that interact to have similarGO terms, though perhaps not entirely the same.Semantic Similarity is a means to measure how similar theannotations of two genes are (0 being no similarity, 1meaning total similarity).GO provides us with a means to do this in a quantitativefashion.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Simple implementation
Determine the ratio of the number of nodes two genes sharewith the total number of nodes they have between them.
GOsimUI =#{N(G1) ∩ N(G2)}#{N(G1) ∪ N(G2)}
(3)
N(G1) being the set of nodes associated with G1’sannotation.
More elaborate methods are available.Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Outline
1 Outline
2 Gene OntologiesOver-representationSemantic similarity
3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures
4 ValidationDREAM
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Motivation
Yesterday, encountered clustering.Hypothesis 1 (weak) :- coexpression implies involvmentin the same process.Expand to many different experiments.Hypothesis 2 (strong) :- greater a level of association,greater the chance of interaction.Hypothesis 2 is often referred to as “guilt byassociation”.Association may tell us about interactions betweengene products. It does not tell us anything about theregulation mechanism.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
http://www.arabidopsis.leeds.ac.uk/act/index.php
266841_at AT2G26150heat shock transcription factor family protein contains Pfam profile:PF00447 HSF-type DNA-binding domain260978_at AT1G53540
17.6 kDa class I small heat shock protein
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
What do we mean by association ?
Knowing something about the expression level of one gene(over many different experiments) means we knowsomething about the expression level of the other.Replotting the above
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Outline
1 Outline
2 Gene OntologiesOver-representationSemantic similarity
3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures
4 ValidationDREAM
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Linear Correlationcoexpression
Simplest form of association.Assume that there is a linear relationship betweengenes.Formally :-
y1 = a12 + c12y2 + η12 , (4)
y1, y2 are (log) expression levelsη12 noise term.a12, c12 parameters to be determined.
But we’re not interested in that !We are interested in asking how good a model is thisfor this pair of genes ?
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Covariance
Can estimate how good the linear model is by computing
E((y1 − y1)(y2 − y2)) ,
where y1, y2 = E(y1),E(y2) are the means of y1 and y2.E means the expectation value of the above (think of itfor now as taking the average over all the points in theprevious figure).Can prove to oneself (exercise) that the magnitude ofthe covariance is largest when y1 can be perfectlyexpressed as a linear function of y2.The covariance is zero when there is no relationship atall between y1 and y2.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●●
●
●
0.6 0.8 1.0 1.2 1.4 1.6 1.8
−1
01
2
y1
y2
Maximum covariance
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●●
●
●
−1 0 1 2 3
−1
01
2
y1
y2
Zero covariance
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Correlation
We want to compare every possible pair of genes, so usingthe covariance is not very practical since the maximumcovariance will vary from pair of gene to pair of gene.However,
ρ12 =E((y1 − y1)(y2 − y2))√
E((y1 − y1)2)E((y2 − y2)
2), (5)
is bounded: −1 ≤ ρ12 ≤ 1.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
How well does it work ?
Number of examples of improved functional annotation.Unannotated gene which is highly correlated with genein a known response implies it is likely to be in thesame response.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Outline
1 Outline
2 Gene OntologiesOver-representationSemantic similarity
3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures
4 ValidationDREAM
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Difficulty : genes correlate with many other genes, notjust a few.Why ?Suggestion : correlations do not distinguish betweenpotential direct interactions and indirect interactionsbetween gene products.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Example
Other interactionsA
B
C
D
E
F
B directly interacts with three other genes, but could behighly correlated with others.C and D would be highly correlated with each othereven though they are not directly interacting.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Artificial example: create randomised data to representexpression of B.Generate two other sets of data (C and D) that are byconstruction highly correlated to the original data set, butare not set out to have a relationship with each other.
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●● ●
●● ●
●
●
●
●
●
●●
●
●●
●
●●
●
● ● ●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●●
●
●
●●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
−4 −2 0 2
−4
−3
−2
−1
01
23
B
C
ρ = 0.98
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
−4 −2 0 2
−4
−3
−2
−1
01
23
B
D
ρ = 0.98
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●● ●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
−4 −3 −2 −1 0 1 2 3
−4
−3
−2
−1
01
23
C
D
ρ = 0.96 (!!!)
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Extending correlations :- partial correlations
Correlations only take pairs of genes into consideration.Partial correlations extends the initial pairwiseregression model introduced in equation 4.
y1 = a1 + b12y2 + b13y3 + · · ·+ b1nyp + η1 . (6)
Again, we are not interested in solving this explicitely.We want to understand the correlation that each one ofthe genes y2 . . . yp has on y1 once we have removedthe effect of all the other genes.We will use the notation PCij to refer to this partialcorrelation.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Derivation
Computed easily once you have all the correlationsbetween all the genes.
R =
1 ρ12 ρ13 . . .ρ12 1 ρ23 . . .ρ13 ρ23 1 . . ....
......
. . .
, (7)
Covariance matrix C is defined similarly.ρij is the correlation between gene i and gene j .
PCij = −R−1
ij√R−1
ii R−1jj
(8)
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Questions
ρij = ρji - why ?Diagonals are 1 - why ?Exercise :- compute PC using the covariance matrix.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Artificial example
RBCD =
1.0 0.96 0.980.96 1.0 0.980.98 0.98 1.0
, (9)
PCBCD =
−1.0 −0.01 0.70−0.01 −1.0 0.700.70 0.70 −1.0
. (10)
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Disadvantages of using Partial correlations
High partial correlations no longer tend to go to 1 (or-1).Dependent on ranking.How large should/can p (the number of genesexamined) be ?Taking inverses of matrices should make us jumpy -especially when there is limited data.Problem also dates to computing correlations.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
“Large p, small n”
Notation
p :- number of variables (in this case, expression of agene)n :- number of measurements (total number of affyslides)
R has of the order p2 (p(p − 1)/2 to be exact)potentially interesting correlations.Could be dealing with of the order 10,0002 variables.Have at best a few thousand measurements per gene :-n ∼ 1000.If p � n, then the definition of equation (5) gives arobust estimate of all those correlations, but that is notwhere we are !
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Artificial example
0 20 40 60 80 100
0200
400
600
p/n = 0.1
eigenvalue
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
0 20 40 60 80 100
0200
400
600
p/n = 0.5
eigenvalue
!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
0 20 40 60 80 100
0500
1000
1500
p/n = 2
eigenvalue
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
0 20 40 60 80 100
0500
1000
1500
p/n = 10
eigenvalue
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Figure 1: Ordered eigenvalues of the sample covariance matrix S (thin black line) and
that of an alternative estimator S! (fat green line, for definition see Tab. 1), calculated from
simulated data with underlying p-variate normal distribution, for p = 100 and various ratios
p/n. The true eigenvalues are indicated by a thin black dashed line.
3
Schäfer and Strimmer: Large-Scale Covariance Matrix Estimation
Published by The Berkeley Electronic Press, 2005
Schäfer and Strimmer, Statistical Applications in Geneticsand Molecular Biology, 4, 1, (2005)
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Explanation
Spectrum of eigen-values.Any eigen-value equal to zero means matrix isnon-invertible.Dashed lines - actual eigenvalues.Thin black lines - estimated eigenvalues using equation(5).Green line - improved estimator.In general, if n < p then the correlation/covariancematrix is non-invertible.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Strategies
Reduce p - cluster data initially, then perform analysison each cluster. (Toh and Harimoto, 2002).Compute lower order partial correlations - compute firstorder partial correlations (Magwene and Kim, 2004).Employ improved estimator of correlations (Schäfferand Strimmer, 2005).These options are not necessarily exclusive.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Shrinkage estimate - wordy explanation
What is computed in equation (5) is an estimate of thecorrelation based on the available data, not the actualcorrelation if we knew the underlying multi-variatedistribution. They would coincide if we had much greaterstatistics. That said, we can use other estimators ofcorrelation. Statisticians have pointed out that many otherpossible estimators can be used which work better in theregime we lie (large p, small n).Shrinkage estimates attempt to combine different naiveestimates to get an improved estimate. The principal hasbeen around for some time (Stein, 1956) though its use hasincreased significantly in the last ten years.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Shrinkage estimates - the details
Notation:
C is the actual covariance matrix.C is an estimate of the covariance matrix.In computing C we could either attempt to compute itusing the standard definition (“full”) or assume (forexample) that all the off-diagonal entries are zero(“reduced”).CF for the “full” estimate of covariance matrix.CR for the “reduced” estimate of covariance matrix.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Mean Square Error
Defining
MSE(C) = E((C− C)2) , (11)= Var(C) + Bias2(C) . (12)
Var(C) = E((C− E(C))2) , (13)Bias(C) = E(C)− C . (14)
(Expectation operator is over the data that we have).
Bias(CF ) is small but Var(CF ) will be large.Var(CR) will be small but Bias(CR) will be large.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
The problem
Depending on the assumptions we use to estimate thecorrelation/covariance matrix
we can either compute a very poor estimate of theparameters in a very accurate model,or compute a good estimate of the parameters for avery inaccurate model (!)But maybe we can reconcile the two...
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
The problem
Depending on the assumptions we use to estimate thecorrelation/covariance matrix
we can either compute a very poor estimate of theparameters in a very accurate model,or compute a good estimate of the parameters for avery inaccurate model (!)But maybe we can reconcile the two...
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
The problem
Depending on the assumptions we use to estimate thecorrelation/covariance matrix
we can either compute a very poor estimate of theparameters in a very accurate model,or compute a good estimate of the parameters for avery inaccurate model (!)But maybe we can reconcile the two...
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Combining the two
One can combine these two estimates:
C∗ = λCR + (1− λ)CF , (15)0 ≤ λ ≤ 1 , (16)
choosing a λ such that MSE(C∗) is minimised.
Computing λ is normally very expensive.Ledoit and Wolf (2003) came up with a short analyticalway of computing λ; Schäffer and Strimmer modifiedthis for genomic data.We have an R package to do this.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Final Note
The Schäffer and Strimmer estimate uses the“zero-covariance” low dimensional model for their estimate,but this isn’t necessarily the best choice.Notably, while shrinkage estimates make much ofincorporating information, they don’t explicitely includeBiological information.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
results
Using E. coli time series data (8 time slices), Schäffer andStrimmer examined 102 genes using correlations andpartial correlations.
aceA
aceB
ahpC
aldA
artJ
artQ
asnA
atpB
atpD
atpE
atpF
atpGatpH
b0725
b1057
b1191
b1583b1963
cchB
cspA
cspG
cstA
cyoD
degS
dnaG
dnaJ
dnaK
eutGfixC
flgC
flgD
folK
ftsJ
gatA
gatB
gatC
gatD
gatZ
glnH glnP
gltA
grpE
hisJ
hns
hupA
hupB
ibpA
ibpB icdA
ilvC
lacA
lacY
lacZ
lpdA
manX
manYmanZ
mopB
nmpC
nuoA
nuoB
nuoC nuoF
nuoH
nuoI
nuoL
nuoM
ompC ompF
ompT
pckA
pspA
pspB
pspC
pspDptsG
pyrB
pyrI
sodA
sucA
sucC
sucD
tnaA
yaeM
ybjZ
yceP
ycgX
yecO
yedE
yfaD
yfiA
ygbD
ygcE
yhdM
yheI
yhfV
yhgIyjbE
yjbO
yjcH
ynaF
yrfH
(a) Shrinkage GGM network
aceA
aceB
ahpC
aldA
artJ
artQ
asnA
atpB
atpD
atpE
atpF
atpG
atpH
b0725
b1057
b1191
b1583
b1963
cchB
cspA
cspG
cstA
cyoD
degS
dnaG
dnaJdnaK eutG
fixC
flgC
flgD
folK
ftsJ
gatA
gatB
gatC
gatD
gatZ
glnH
glnP
gltA
grpE
hisJ
hns
hupAhupB
ibpA
ibpB
icdA
ilvC
lacA
lacY
lacZ
lpdA
manX
manY
manZ
mopB
nmpC
nuoA
nuoB
nuoC
nuoF
nuoH
nuoI
nuoL
nuoM
ompC
ompF
ompT
pckA
pspA
pspB
pspC
pspD
ptsG
pyrB
pyrI
sodA
sucA
sucC
sucD
tnaA
yaeM
ybjZ
yceP
ycgX
yecO
yedE
yfaD
yfiA
ygbD
ygcE
yhdM
yheI
yhfV
yhgI
yjbE
yjbO
yjcH
ynaF
yrfH
(b) Lasso GGM network
aceA
aceB
ahpC
aldA
artJ
artQasnA
atpB
atpD
atpE
atpF
atpG
atpH
b0725
b1057
b1191b1583 b1963
cchB
cspA
cspG
cstA
cyoD
degS
dnaG
dnaJ
dnaK
eutG
fixC
flgC
flgD
folK ftsJ
gatA
gatB
gatC
gatD
gatZ glnH
glnPgltAgrpE
hisJ
hns hupA hupB
ibpA
ibpB
icdA
ilvC lacA lacY
lacZ
lpdA
manX
manY
manZmopB
nmpC
nuoA
nuoB
nuoC
nuoF
nuoH
nuoI
nuoL
nuoM
ompC
ompF
ompT
pckA
pspA pspB pspC
pspD
ptsG
pyrB
pyrI
sodAsucA
sucC
sucDtnaA
yaeM
ybjZ
yceP
ycgX
yecO
yedEyfaDyfiA
ygbD ygcE yhdM yheI
yhfV
yhgI
yjbE
yjbO
yjcH
ynaF
yrfH
(c) Relevance network
Figure 5: Gene networks inferred from the E. coli data by (a) the shrinkage GGM ap-
proach presented in this paper (Tab. 1), (b) the lasso GGM approach by Meinshausen and
Bühlmann (2005), and (c) the relevance network with abs(r) > 0.8. Black and grey edges
indicate positive and negative (partial) correlation, respectively.
23
Schäfer and Strimmer: Large-Scale Covariance Matrix Estimation
Published by The Berkeley Electronic Press, 2005
Correlations(Relevance Network)
aceA
aceB
ahpC
aldA
artJ
artQ
asnA
atpB
atpD
atpE
atpF
atpGatpH
b0725
b1057
b1191
b1583b1963
cchB
cspA
cspG
cstA
cyoD
degS
dnaG
dnaJ
dnaK
eutGfixC
flgC
flgD
folK
ftsJ
gatA
gatB
gatC
gatD
gatZ
glnH glnP
gltA
grpE
hisJ
hns
hupA
hupB
ibpA
ibpB icdA
ilvC
lacA
lacY
lacZ
lpdA
manX
manYmanZ
mopB
nmpC
nuoA
nuoB
nuoC nuoF
nuoH
nuoI
nuoL
nuoM
ompC ompF
ompT
pckA
pspA
pspB
pspC
pspDptsG
pyrB
pyrI
sodA
sucA
sucC
sucD
tnaA
yaeM
ybjZ
yceP
ycgX
yecO
yedE
yfaD
yfiA
ygbD
ygcE
yhdM
yheI
yhfV
yhgIyjbE
yjbO
yjcH
ynaF
yrfH
(a) Shrinkage GGM network
aceA
aceB
ahpC
aldA
artJ
artQ
asnA
atpB
atpD
atpE
atpF
atpG
atpH
b0725
b1057
b1191
b1583
b1963
cchB
cspA
cspG
cstA
cyoD
degS
dnaG
dnaJdnaK eutG
fixC
flgC
flgD
folK
ftsJ
gatA
gatB
gatC
gatD
gatZ
glnH
glnP
gltA
grpE
hisJ
hns
hupAhupB
ibpA
ibpB
icdA
ilvC
lacA
lacY
lacZ
lpdA
manX
manY
manZ
mopB
nmpC
nuoA
nuoB
nuoC
nuoF
nuoH
nuoI
nuoL
nuoM
ompC
ompF
ompT
pckA
pspA
pspB
pspC
pspD
ptsG
pyrB
pyrI
sodA
sucA
sucC
sucD
tnaA
yaeM
ybjZ
yceP
ycgX
yecO
yedE
yfaD
yfiA
ygbD
ygcE
yhdM
yheI
yhfV
yhgI
yjbE
yjbO
yjcH
ynaF
yrfH
(b) Lasso GGM network
aceA
aceB
ahpC
aldA
artJ
artQasnA
atpB
atpD
atpE
atpF
atpG
atpH
b0725
b1057
b1191b1583 b1963
cchB
cspA
cspG
cstA
cyoD
degS
dnaG
dnaJ
dnaK
eutG
fixC
flgC
flgD
folK ftsJ
gatA
gatB
gatC
gatD
gatZ glnH
glnPgltAgrpE
hisJ
hns hupA hupB
ibpA
ibpB
icdA
ilvC lacA lacY
lacZ
lpdA
manX
manY
manZmopB
nmpC
nuoA
nuoB
nuoC
nuoF
nuoH
nuoI
nuoL
nuoM
ompC
ompF
ompT
pckA
pspA pspB pspC
pspD
ptsG
pyrB
pyrI
sodAsucA
sucC
sucDtnaA
yaeM
ybjZ
yceP
ycgX
yecO
yedEyfaDyfiA
ygbD ygcE yhdM yheI
yhfV
yhgI
yjbE
yjbO
yjcH
ynaF
yrfH
(c) Relevance network
Figure 5: Gene networks inferred from the E. coli data by (a) the shrinkage GGM ap-
proach presented in this paper (Tab. 1), (b) the lasso GGM approach by Meinshausen and
Bühlmann (2005), and (c) the relevance network with abs(r) > 0.8. Black and grey edges
indicate positive and negative (partial) correlation, respectively.
23
Schäfer and Strimmer: Large-Scale Covariance Matrix Estimation
Published by The Berkeley Electronic Press, 2005
Partial Correlations (GraphicalGaussian Network)
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Results of comparison
Recover centrality of sucA gene.lacA, lacZ and lacY genes have the largest absolutepartial correlations.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Outline
1 Outline
2 Gene OntologiesOver-representationSemantic similarity
3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures
4 ValidationDREAM
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
So far, we have concerned outselves with linearrelationships.However, such an approximation may not be valid.Naively, one expects a more non-linear relatiopnshipbetween gene products.For example, typically Transcription Factor - targetinteractions are modelled using Michaelis-Mentonkinetics.Furthermore expression levels are derived after anumber of transformations.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
One approach: Spearman correlation
Basic idea: use ranks rather than raw data.Use nearly the same definition of linear (Pearson)correlation.Must be careful about ties, i.e. raw data havingprecisely the same value (unlikely for floating pointdata).
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Comparison
Comparison of different measures.Many other methods for non-linear measures are possible,the best known being Mutual Information.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Outline
1 Outline
2 Gene OntologiesOver-representationSemantic similarity
3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures
4 ValidationDREAM
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Modelling of gene interactions
We have only just touched upon methods for inferringinteractions between gene products using transcriptomicdata. Some of the others include the use of
Mutual Information/Spearman Correlations - addressesnon-linearities.Kinetic models - attempt to infer interactions.Boolean Networks - model interactions as circuitry.Petri Nets - Prof. Ming Chen.Bayesian Networks - Dr. Chris Needham.Machine Learning methods -Unsupervised/semi-supervised/supervised learning.Integration of other data sources.. . .
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Explosion of methods
160 Annals of the New York Academy of Sciences
FIGURE 1. The number of publications retrieved from a PubMed search of “PathwayInference” or “Reverse Engineering.” On average, this number has been doubling every twoyears since about 1995. (Color is shown in online version.)
reverse-engineering methods was going to bea key prerequisite to their increasing value tobiology. Indeed, at that time, the field of re-verse engineering biological networks was be-ginning to experience considerable expansion,which generated much confusion about whichmethods were truly valuable from a practicalperspective. Evidence of this trend is shownin Figure 1, where we report the number ofpublications retrieved from a PubMed searchof “Pathway Inference” or “Reverse Engineer-ing.” Figure 1 shows what appears to be anexponential growth, in which citations to thesekey words have been roughly doubling everytwo years for the last decade or so.
While such growth in the number of re-verse engineering–related publications wasfueled by very innovative and elegant compu-tational methods for network reconstruction—arising from physics, computer science, mathe-matics, and engineering—the group shared thefeeling that, ultimately, an algorithm’s worthwas to be found in the experimentally vali-dated quality of its predictions. A key problem is
that computational methods can, in the blink ofan eye, generate large numbers of predictions,from a few hundred to hundreds of thousands,most (if not all) of which usually go untested.Even worse, and this would be a best case sce-nario, a very small subsample of predictions—usually three or more, but rarely more thanten—would be validated using sound experi-mental assays and then presented as valuablecriteria for the soundness of the entire set of pre-dictions. Thus, a clear characterization of therelative strengths and weaknesses of the algo-rithms on an objective basis was usually missing.It should be noted that the same could be saidfor high-throughput (and even low-throughput)experimental approaches, whose false positiveand, equally importantly, false negative ratesare rarely considered a requisite for publica-tion. This generated, for instance, more thana few puzzled looks when the first experimen-tally generated, genome-wide interactomes inyeast4–6 showed minimal overlap.
At the spring 2006 meeting, it was agreedthat while the obstacles for the creation of a
Stolovitzsky et al., Ann. N.Y. Acad. Sci. (2009)Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Validation - DREAM
While there are a colossal number of methods out there, thevalidation of them is very much in its infancy.DREAM (Dialogue for Reverse Engineering Assessmentsand Methods) is an attempt to deal with this question.Features: Unseen experimental data of (for example)
Transcription Factor bindings sites,artificial data (in silico),genome-wide interactions,
is gathered and groups are invited to reproduce theinteractions. Different groups results are then comparedagainst data to determine how well they did. Newchallenges are presented on an annual basis.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Some Results from DREAM 2 (2007)
Challenge 1:Identify targets of transcription factor BCL6.
53 genuine targets of BCL6 inferred from unpublishedChP-chip data.147 decoys addes.Task : identify the genuine targets from decoys bypicking genes with similar expression patterns to BCL6.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Results
Best approaches were selective in the data setsemployed. (Data sets where BCL6 was highlyexpressed or not expressed at all were used.)Semi-supervised learning was employed - using knowntargets of BCL6 to train best method.Used correlations.1st-order Partial correlations did badly.Basic correlations were approaching mostsophisticated approaches.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
More results from DREAM
Challenge 2: E. Coli network.From RegulonDB good evidence for targets to assortedTranscription Factors.Task : identify targets.Results
Best method used Mutual Information and ideas behind1st order partial correlation.Correlations and Partial Correlations were not too farbehind.Level of identification of targets was low - perhaps 5%.
Hugh Shanahan Associative methods in Systems Biology
Associativemethods inSystemsBiology
HughShanahan
Outline
GeneOntologiesOver-representation
Semantic similarity
AssociativeMeasuresHypotheses
Linear Correlation
Partial Correlation
Non-linear measures
ValidationDREAM
Conclusions
GO terms allow us to handle large amounts ofannotation in a structured fashion.Associative measures are a first attempt at using thehuge amounts of expression data that is out there.Very simple ideas such as correlation work surprisinglywell (or rather more complicated methods ofassociation don’t give orders of magnitude betterperformance).A long way to go nonetheless.The type of expression data we select;a clear understanding of what microarray/RNA-seq/...technology;may be even more important.
Hugh Shanahan Associative methods in Systems Biology