netbiosig2012 chrisevelo

In modern systems biology we have three main data domains.

1) Experimental data from genomics types of experiments like in the example,

(bottom right) microarrays. Note that this type requires intensive

precalculations (quality control, filtering, clustering, annotation) but that is

not enough to really understand the data. You see patterns in the data, but

you do not really know what they mean. Large scale genomics data has

been available over the pas 15 years or so, and although technologies

used are now being replaced that doesn’t really change this field.

2) Existing knowledge (see next slide), that can be used to better understand

the two other types of data

3) Genetics (sequence based) data that rapidly becomes more important with

the decrease of sequencing cost. The addition of the leftmost corner to the

triangle is relatively new, and I will only discuss it in the last few slides

2

Huge amounts of existing knowledge can be found hidden in the literature or in

the heads of people. The hard task is to collect it from there and to make it

available for analysis. (People on the slide are Ben van Ommen - NuGO

director, Hannelore Daniel – nutrigenomics chair from Munich and a Thai

Princess and institute director.

Note that a lot of information is also available in curated databases, but that

was left out of the talk for brevity reasons. You could say that structuring of the

other knowledge is needed to provide these databases that can then be used

for analysis.

3

An historical example of a microarray result. Again note the intensive

preprocessing done. (clustering to the left, annotation to the right).

Nevertheless the data is very hard to understand. Especially if you take into

account that there are about 20,000 genes on a typical array. About as much

as there are words in a dictionary.

But if you are willing to make the effort you can actually see meaningful groups

of genes within specific coexpression clusters. Like the fatty acid degradation

genes shown here. But it is hard to find (or easy to miss) all relevant pathways.

Probably not an iPAD, those microarrays were at least 10 years old.

6

The problem is not only the long list of resulting genes, but also the

oversampling that occurs. In genomics experiments you typically get large

numbers of false positives at useful levels of significance. Of course false

discovery rate corrections exist but they will usually also loose information.

Pathway or function group (ontology) analysis helps since it is not likely that a

larger set of genes occur as false positives within a smaller functional group.

On the other hand the meaning of pathway statistics should not be

overestimated There are many aspects in real biology and in the way the

groups are build that influence the statistical outcome.

For instance when you have two metabolic reactions where one is catalyzed

by a single enzyme and the other by 4. Are all enzymes of the same

importance? Or are the four together as important as the single one? Or are 3

of the 4 not important in reality and the other one is? All these situations can

occur and the statistics just doesn’t know.

Also suppose you 10 non-regulated genes to a pathway. That will change

significance of your result, but it doesn’t change the biology behind it.

7

Example of a pathway that can be used for the purposes described.

A closer look at the same pathway.

Note that this uses MIM notation from the MIM PathVisio plugin.

In general the connections between different genes and metabolites describe

the network underlying the pathway. Note that this is already quite complex

since there are different ways to show what interacts with what.

Graphical methods to capture this like MIM and SBGN definitely help. The

result can be captures in descriptive relationships in BioPax,

9

10 10 10

PathVisio can do a combined visualization of different omics results. Here

proteomics and transcriptomics both shown on the same gene product boxes.

It can also show effects from metabolomics.

12 12

Examples of pathways like we have them on wikipathways.org

13

This talk is not really about WikiPathways. Check out the information in the

paper or the information on the wiki itself. (www.wikipathways.org) developer

information is mainly on the www.pathvisio.org website.

http://www.wikipathways.org/

http://www.pathvisio.org/

14

You obtain microarray data (e.g. affymetrix)

You can visualize micorarray data

Each color corresponds to a measured datapoint

For example, green is up, red is down, grey is constant

And now? How do you make sure the Affymetrix probeset IDs related to the

measurements can be mapped to the gene products in the pathway?

15

On WikiPathways (or in pathvisio) you can attach identifiers to each gene. A

click opens up the corresponding page on (this specific case) the worm

database.

You can download the corresponding transcript sequence in two clicks

This makes it for instance really easy to design primers

As soon as you have entered one (and only one) identifier to describe what

gene product or metabolite you really mean this information is linked to many

other identifiers from other databases and links to these respective pages are

shown in the so called “backpage” (actually one of the pages under the tabs at

the righthand side of the pathway).

16

BridgeDB (see www.bridgedb.org and the paper mentioned on the slide)

provides the mechanism needed for that identifier mapping.

17

http://www.bridgedb.org/

Pathways can be downloaded to be used in different tools.

There is also a wikipathway webservice. See:

http://www.wikipathways.org/index.php/Help:WikiPathways_Webservice

Thomas Kelder, Alexander R Pico, Kristina Hanspers, Chris Evelo & Bruce R

Conklin. Mining biological pathways using WikiPathways web services.

PLoS One (2009) 4: 7 e644. http://dx.doi.org/10.1371/journal.pone.0006447

We also have semantic output in RDF which can be queried through a

SPARQL endpoint described at semantics.bigcat.unimaas.nl.

http://www.wikipathways.org/index.php/Help:WikiPathways_Webservice

Introducing a problem

19

And a solution that isn’t really a solution. There are just too many things you

could add.

20

The PathVisio Regulatory Interaction plugin (author Stefan van Helden) has a

new approach where information is not really added to a pathway, but shown

in a separate page upon request.

21

22

The plugin can be found here:

http://chianti.ucsd.edu/cyto_web/plugins/displayplugininfo.php?name=GPML-Plugin

It can be used to read and write gpml pathway files used by WikiPathways and

PathVisio in Cytoscape




23

Example showing some more advanced usage of the GPML plugin.

Data from the NuGO proof of principle study with dietary challenged mice.

Three tissues were sampled and in the other two tissues relatively many

genes showed expression changes on Affymetrix arrays but not many

pathways were found.

For liver the number of genes affected was lower but the number of pathways

found to be affected was found to be higher (how come)?

The pathway based network analysis showed that there was a set of stronger

affected pathway (more reguated genes, large blue circles) that share

regulated genes (the red diamonds). When looking at the highlighted group of

pathways it became clear that these all belong to the same superste of

biologically relevant pathways (fatty acid metabolism and inflammation).

24

A paper that we published with a more extensive pathway relationship

approach. It takes into account relations between pathways through affected

genes not necessarily showing up in either pathway.

The approach takes into account all data use (pathways, interactions and

experimentally determined weight). Check out the original paper for details.

26

Example result. Pathways with stronger interaction based on gene snot

present in them.

27

And you can do the same for relatively large sets of pathways “driving” a

process like apoptosis.

28

CyTargetLinker is a Cytoscape plugin that can be used to extend one network

with information about things targeting entities in that network from databases

that are created as a network. It already provides a number of target relation

databases as mentioned on the slide.

29

Example of a target network. (You will normally see this, it contains the

information that is used to extend your source network).

30

31

And a more detailed view.

You can drive it from a gene set, that isn’t even a network at the start. But

when miRNAs are found to target more than one gene in the ggroup the

network is created on the fly.

32

Or you can bootstrap the approach from an existing network. Which can be a

pathway based one imported with the GPML plugin like shown here.

33

An overview of the Open Phacts project that pulls in lots of information in a

semantic web triple store (including information from WikiPathways RDF) and

then provides that for use in other tools. In WikiPathways we use that to

suggest possible pathway extensions to curators

34

This show the PathVisio Loom plugin in action. A gene or metabolite in a

pathway under development (left side) is right clicked and the LOOM is

activated to pull related genes or metabolites from another resource

(database, text mining result or Open Phacts API). The suggested interactions

are shown in the window on the right and the entities are added to the pathway

(two already shown on the left).

Talk so far focused on the genomics-knowledge relationship shown on the

right, So what about genetics?

36

38

This is the image was to us by Jim Kaput (at that time NTCR, now

Nestle).”Look people group those SNPs in gene groups, made sense of the

directions and showed them in a pathway. Can you do something like that?”

In principle? Yes.

39

There are just too many SNPs for any given gene.

40

So it would really look like a bunch of jellies if we show these all on the genes

in a pathway, and you would not know what they mean.

41

There are loads of bioinformatics tools out there (like Sift and Polyphen) that

allow us to estimate functional effects of SNPs on coded protein (activity or

protein-protein interactions), binding site for transcription factors in the DNA, or

miRNA in RNA. Doing that we can decide what edges SNPs would affect (and

how much in what direction). Now as soon as you do that you can use the

result to strengthen SNP statistics (ie create groups that can be used for

supervised types of group based GWAS analysis) or to build predictive models

to estimate that specific (personal or tissue/tumor based) sets of variations

would do. That provides a need to use the pathways to link experimental

(genomics) data not only to the genetic variations occurring in there, but also

to modeling results

42

Showing the concept. Integrating flux predictions from modelling (of course

that could also be real fluxomics data)

43

44

And showing “real” results from the new flux data representation plugin.

The plugin is functional but we still need better mapping databases for reaction

identifiers

Many people involved in this work. (Really many if you count associated

groups like the plugin developers, pathway curators etc).

Most important

SF group (Kristina Hanspers, Bruce Conklin and Alex Pico) collaborating on

many things but primarily WikiPatwhays

Martijn van Iersel top left (PathVisio, BridgeDB). Thomas Kelder (top middle)

(WikiPathways including webservices, pathway integration networks for

nutrigenomics), Martina Kutmon (top right) (CyTargetLinker, PathVisio further

development), Andra Waagmeester (second row, right) (WikiPathways RDF),

Anwesha Dutta (bottom, 2nd from the left) (flux visualization), Stefan van

Helden (not on the picture) for the RI PathVisio plugin

45

netbiosig2012 chrisevelo

Technology

microarray data

experimental data

different genes

nonregulated genes

main data domains

meaning of pathway statistics

large scale genomics

lot of information