medicel integrator platform - views to current themes in ... · vg1 wnt. part 3 – store and share...

Post on 06-Oct-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Medicel Integrator platform -views to

current themes in systems biology

Tommi AhoComputational Systems Biology 1

6.2.2008

(slides modified from material of Medicel)

Outline

•Part 1: Biology is complex•Part 2: How to model biology – in theory•Part 3: Is the data available?•Part 4: Data integration

Part 1 – Biology is complex

http://www.studiodaily.com/main/technique/tprojects/6850.html

Part 2 – How to model biology – in theory

Part 2 – How to model biology – in theory

Biology Modeling

Data Integration

anatomybiochemistry

botanycell biology

ecologyevolutiongenetics

immunologyhistolocymicrobiologyparasitologypathologypharmacalogyzoology

Structural models,Graph models,

FBA, ODEdx/dt = f (x(t), u(t) ,t ) y(t) = g (x(t), u(t), t )

Partial differential equationsStatistics, Optimization

XML, SBML, SQL, RDF, OGSA-DAI, etc

Part 2 – How to model biology – in theory

Part 3 – Is the data available?

Part 3 – Long history of research

•1924: Spemann and Mangold reveal the phenomenon of primary embryonic induction

•1970: Nieuwkoop et al. show that animal hemisphere cells are induced to become mesoderm by signals from vegetal hemisphere cells (so it depends on primary polarity axis definition)

•1990: ...Fig. from Scott GilbertDevelopmental BiologySinauer Press

Part 3 – Documentation of components

•1990: Asashima et al., Smith et al., Sokol et al. Show that activin A (TGFb) induces mesoderm. Then noggin and Vg1 join the list. (Chicken ovalbumin genes was cloned 10 years earlier)

•> 5000 references to text•> 150 images

Part 3 – The components

•Once upon a time: One gene, one protein, one function

activin

Cell differentiation

Part 3 – More functions

•One gene, many functions...

activin

Gonadotropin release

Cell differentiation

Inflammation

Carbohydrate metabolism

Protein & steroid metabolism

Part 3 – More components

•Redundancy, overlapping, specificity, divergence...

activin

Gonadotropin release

Cell differentiation

Inflammation

Carbohydrate metabolism

Protein & steroid metabolism

nogin

Vg1wnt

Part 3 – Store and share the data

•Scientific documentation today

Part 3 – Store and share the data

•Scientific documentation today

The user hard disk

Part 3 – Data is far away

•typical

Part 3 – Data should be at hand

•integrated

Part 3 – Conclusions

•Most of the data is never shared•No systematic data accumulation•Lacking meta-data: what parameter was measured, where did the sample come from and when was the parameter measured?

•Seriously impairs our competitiveness•IT solutions needed - biomedical researchers cannot resolve the problems alone

Part 4 – Data integration

“Integration is difficult”Stein, L.D., Integrating Biological Databases. Nature Rev. Genet. 4, 337-345 (2003)

Part 4 – Integration example

Model­ as SBML file­ 612 compounds with IDs

Model­ as Excel file­ 1039 compounds with somewhatsimilar IDs with SBML model­ 756 corresponding KEGG IDs

KEGG database­ 1843 compounds withKEGG IDs

Part 4 – Integration example

538 same

Model­ as SBML file­ 612 compounds with IDs

Model­ as Excel file­ 1039 compounds with somewhatsimilar IDs with SBML model­ 756 corresponding KEGG IDs

KEGG database­ 1843 compounds withKEGG IDs

501 not found

74 not found1255 not found588 same

169 not found

Part 4 – Integration difficulties

•Diversity of data•Heterogenity of available databases:

› Data stored in different formats› Often no schema (i.e. structural definition) available

Part 4 – Integration difficulties

•Conflicts of terms: What is a gene?•Namespace difficulties (1):

› One object, multiple names

e.g. P53_HUMAN: P04637, Cellular tumor antigen p53, Antigen NY-CO-13,Tumor suppressor p53, Phosphoprotein p53, p53, ...

= = =

P04637P53_HUMAN Phosphoprotein

p53

Tumor suppressor

p53

= ...

Part 4 – Integration difficulties

•Namespace difficulties (2): › Multiple objects, one name

e.g. P53 refers to

• a set of proteins across different species

• a set of transcripts encoding those proteins

• a set of genes encoding those transcripts

Common name

...Object 1 Object 2 Object 3 Object 4

!= != != !=

Part 4 – Technical difficulties

•Lack of metadata - or metadata exists, but in unstructured form (e.g. notes) that is not computer readable

•External databases: No standard accession method•Database versions: Updated vs. old data•Data model: No unified model available•Amount of data

•The system includes following data sources: › ENSEMBL

› NCBI Taxonomy

› NCBI Refseq Proteins

› UniProt/Swissprot

› UniProt/TrEMBL

› Interpro

› Mammalian Phenotype Ontology

› IntAct

› KEGG

› Human Disease Ontology

› GO (Gene Ontology)

› Cell Ontology

Part 4 – Databases in Integrator

• Chebi• Cytomer• Brenda Tissue Ontology• PDB• PubMed

Current database• 2,5 million proteins

• 75 000 genes

• 98 000 transcripts

• 10 million connections on 144 000 pathways

• 1200 different species

Part 4 – Database Schema - Medicel Infomodel

•Performing efficient searches across databases presents a big problem as the database structures are not unified

•Answer -> Structuring of data into a unified schema •Medicel Infomodel is the framework of the platform•Explains how data is organized into tables and fields of

the database•Using a unified schema is indispensable when wanting to

bring different experimental data together•Data is much more worth when it is compatible -> more

likely to arouse new knowledge

•Schema to model biology•Divided into biological data and meta data•Biological systems consist of interacting components•Interactions effect the change in the amounts of the

components•Amounts of the components give the state of the system•Pathways model these systems

Part 4 – Database Schema - Medicel Infomodel

•About 200 data tables constitute a relational database •Tables define the attributes of objects and the relations

of the objects to each other•E.g. a gene can be annotated to a category and the

category annotated to be part of another category•Data in the tables is structured in rows and columns

› Table -> Object Class› Row -> Object› Column -> Property of Object

•Knowledge of the Infomodel is not required of every user

Part 4 – Database Schema - Medicel Infomodel

Part 4 – Medicel Infomodel at high level

Component Data System Data State Data Laboratory Data

(This is an abstract representation showing only a fraction of the Medicel Infomodel.)

Part 4 – What is Component Data

•Definitions of quantifiable components (e.g. protein, genome, gene, macromolecular complex, organism)

› Name is not a real definition› Structural facts are concrete definitions that

• can be detected in laboratory• compared by computer algorithms

› Component list (formula)• implies molecular mass and charge

› Patterns• Bonds between components

› Sequence› Features

•Useful definitions can explain system behaviour

Part 4 – Where does component data come from?

•Population of databases› e.g. UniProt, Ensembl are protein databases› The key is to identify “reference objects” -> one unique

name which may have many database references•Own components given in

› Individuals e.g. patients examined› Populations e.g. any group of individuals like ‘the Finns’› Organisms e.g. genetically engineered microbe strains

Part 4 – What is system data

A system is an assemblage of inter-related elements comprising a unified whole (Wikipedia)

Location•a named real biological system that can be identified•a unique location needs to be created for each distinct

biologically interesting context•are related through common components•for each Location, information is recorded about

› Environment› Population› Individual› Organism› Organ› Tissue› Cell type› Cellular compartment

... an assemblage of inter-related elements comprising a unified whole

Locations are related through common components

i n p u to u t p u t

L[location1]: En[fermentor]

L[location2]: En[fermentor]Po[population]

O[Saccharomyces cerevisiae]Ct[yeast_cell]

L[location3]:

En[fermentor] ...

Cc[nucleus]

Components

•Various kinds of components› Genes, Transcripts, Proteins, Compounds,

Macromolecular complexes...

› but also, at a higher level: Cell types, Individuals, Populations, Environments

• not limited to molecular systems

... an assemblage of inter-related elements comprising a unified whole

Interaction•an event (or a process)

› typically, a biochemical event

•Components are connected to Interactions via Connections

› Different types of connections:• substrate (is consumed)

• product (is produced)

• control (is neither consumed or produced, but affects)

• outcome (not consumed or produced, but affected)

... an assemblage of inter-related elements comprising a unified whole

     

Example: Transcription

gene

transcript

transcription

connections

Pathway

•a network model of one location› a container for the components and interactions

•there can be multiple pathways for one location› at different abstraction levels

› alternative models from different origin, creators, evidence

... an assemblage of inter-related elements comprising a unified whole

Part 4 – What is state data?

•State data describes quantitatively the state in which a location (system) currently is in

› May quantify something about the location itself or about a component in the location

•Non-state data can be derived from state data

› E.g. p-values are quantitative but not state data

Part 4 – Infomodel for state data

s t a t ev a r ia b le

v a r ia b le

u n it

c o m p o n e n t

lo c a t io n

p a t h w a y

s t a t e d a t ap o in t

s a m p le

in d e x

t im e o fo b s e r v a t io n

fr e e t e x td e s c r ip t io n

v a lu e

0 . . 1

1

1

0 . . *

0 . . 1

1

1

1

0 . . 1

0 . . 1

0 . . 1

0 . . 1

x - c o o r d in a t e

y - c o o r d in a t e

z - c o o r d in a t e0 . . 1

0 . . 1

0 . . 1

Quantitative information

Biological information

Storing information

Several data points per state variable – one state variable per data point

top related