bits: overview of important biological databases beyond sequences

Basic bioinformatics concepts, databases and tools

Module 4

Beyond the sequences

Dr. Joachim Jacob

http://www.bits.vib.be

Updated Nov 2011http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod4-intro_H1_2011_otherRelevantData.pdf

http://www.bits.vib.be/

http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod4-intro_H1_2011_otherRelevantData.pdf

Module 4 broadens our view

To understand life, we need not only sequences, but many other concepts

Bioinformatics is also storing and analyzing− gene information: variations, isoforms,...

− Expression data

− 3D protein structure data

− Interaction data

− Pathways and network

“Storing all relevant biological data”

Schematic view II

GeneA sequence annotations – gene expr – pathway – struct,...

GeneB sequence annotations – gene expr – pathway – struct,...

GeneC sequence annotations – gene expr – pathway – struct,...

analysis

Primary databaseOther sequence databases

results

Additional information sources

results

The indispensable databases

Gene Ontology – structuring KEGG – biochemical pathways PDB – Structure of proteins Intact – Interaction data dbSNP – database of genomic variation Expression sources – Microarray data

http://www.geneontology.org/

http://www.pdb.org/pdb/home/home.do

http://www.ebi.ac.uk/intact/main.xhtml

http://www.ncbi.nlm.nih.gov/projects/SNP/

Gene Ontology structures the way we communicate about life

http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax.pdf

http://www.arabidopsis.org/help/tutorials/go1.jsp

Gene translation Protein synthesisProtein production



Gene Ontology structures life


Agreement on standardized keywords (often referred to as 'controlled vocabularies'), describing all natural processes in an hierarchical way (ontology).

Keywords are assigned to genes based different evidence

Keywords are ordered in a hierarchical tree-like structure ( 'directed acyclic graphs')

Three GO 'trees' exists, describing:

"Biological Process"

"Cellular Component"

"Molecular Function"






A gene can be given different GO terms

Example, cytochrome c:

molecular function: oxidoreductase activity,

biological process: oxidative phosphorylation and induction of cell death,

cellular component: mitochondrial matrix and mitochondrial inner membrane.

In each tree, the terms are organised in a directed acyclic graph: a network consisting of parents and child-terms (as nodes) and lines between them as relationships.

Different evidence codes can assign a degree of confidence to the assignment

http://www.geneontology.org/GO.evidence.shtml

Evidence codes can be grouped by: Experimental (e.g. IDA – inferred from direct assay)

Computational analysis

Author statement

Curator statement

Inferred from electronic annotation (IEA)

If available, each annotation has also a reference

http://www.geneontology.org/GO.evidence.shtml

Different evidence codes can assign a degree of confidence to the assignment

Gene Ontology structures all genes according to their biological significance

The GO structure and the terms can be browsed by a browser called AmiGO.

The Quick Go from EBI has some nice visualisation

Excellent GO-wiki for all your questions

http://www.ebi.ac.uk/QuickGO/

http://wiki.geneontology.org/index.php/GO_FAQ

GO can be used to retrieve all gene (products) related to one specific term

You can search broad, e.g. Amigo search for Diabetes leads to following GO term

http://amigo.geneontology.org/

http://amigo.geneontology.org/

GO can be used to retrieve all gene (products) related to one specific term

Amigo search for Diabetes

GO is also useful to analyze and compare different gene lists

A lot of tools on GO are available on website.

http://www.geneontology.org/GO.tools.shtml

http://www.geneontology.org/GO.tools.shtml

Some things to know about GO

For analyses, one can make use of 'shrinked' GO sets, the so-called GO-slims

– GO slims are a subset of biologically more relevant GO terms (available per species)

– GO ontologies can be downloaded in .obo format.

Not all information is captured by GO and need to be retrieved in other databases

Metabolic pathways: KEGG, …

Phenotype/diseases

• Mapping files exists e.g. kegg2go

http://www.geneontology.org/GO.slims.shtml

http://www.geneontology.org/external2go/kegg2go

http://www.geneontology.org/GO.slims.shtml

Biological pathways databases organise genes by molecular reactions

3 important databases on biological pathways

http://www.kegg.jp/

http://www.reactome.org/ - EBI

http://metacyc.org

http://www.kegg.jp/

http://www.reactome.org/

http://metacyc.org/

Proteins with enzymatic function receive an Enzyme Commission (EC) number

http://www.chem.qmul.ac.uk/iubmb/enzyme/

EC 6 Ligases

EC 5 Isomerases

EC 4 Lyases

EC 3 Hydrolases

EC 2 Transferases

EC 1 Oxidoreductases

http://www.chem.qmul.ac.uk/iubmb/enzyme/

IntAct database contains interaction information of proteins

http://www.ebi.ac.uk/intact

Three types of interactions stored Protein-protein Protein-dna Protein-small molecule

http://www.ebi.ac.uk/intact

IntAct database represents all interactions as binary: caution!

Interaction networks can be analysed on your computer using Cytoscape

Cytoscape training material on the BITS website

http://www.bits.vib.be/index.php?option=com_content&view=article&id=17204093:python-training-material&catid=84&Itemid=610

PDB hosts 3-dimensional structural data on molecules

PDB hosts 3-dimensional structural data on molecules

PDB = Protein DataBankhttp://www.pdb.org/pdb/home/home.do

Only structures resolved through NMR and X-ray (or other accurate techniques)

Proteins DNA RNA Ligands

Understanding PDB data: tutorial

http://www.pdb.org/pdb/home/home.do

http://www.rcsb.org/pdb/static.do?p=education_discussion/Looking-at-Structures/intro.html

PDB files can be read by a lot of different tools to display the structure

Every entry in PDB contains its own PDB accession number (often 1 digit and three letters)

The PDB file contains 3D coordinates from every single atom in the structure, together with variability of that position (last two digits)

http://www.bits.vib.be/index.php?option=com_content&view=article&id=17203817:protein-structure-analysis-training&catid=81:training-pages&Itemid=190


PDB files can be read by a lot of different tools to display the structure

Tools to visualize (and some to analyze structures) (see BITS wiki)

http://www.bits.vib.be/wiki/index.php/Protein_structure

http://www.bits.vib.be/wiki/index.php/Protein_structure

To find a structure for your protein sequence is to search for similarity

Homology modeling

Similarity on sequence level projected to a structure Blast your query against PDB db by cblast , or at expasy

PSI-BLAST - can detect sequences with similar structures (twilight zone!)

If still no success: 3D-jury (a meta approach, including fold recognition and local structure prediction)

Similarity on structural level: aligning structures VAST (structure)

Distance mAtrix aLIgnment DALI

http://www.ii.uib.no/~slars/bioinfocourse/PDFs/structpred_tutorial.pdfhttp://consurf.tau.ac.il/pe/protexpl/psbiores.htm

BITS training on protein structure analysis

Tools at EBI

http://www.ncbi.nlm.nih.gov/Structure/cblast/cblast.cgi?

http://expasy.org/tools/blast/

http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on&LINK_LOC=blasthome

http://meta.bioinfo.pl/submit_wizard.pl

http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml

http://ekhidna.biocenter.helsinki.fi/dali_server/

http://www.ii.uib.no/~slars/bioinfocourse/PDFs/structpred_tutorial.pdf

http://consurf.tau.ac.il/pe/protexpl/psbiores.htm


http://www.ebi.ac.uk/Tools/structural.html

Structural information is used to classify proteins

SCOP

Groups proteins based on evolutionary, domain architecture and structural information.

CATH

Manually curated classification on protein domains

Database cross-references in PDB entry

http://scop.mrc-lmb.cam.ac.uk/scop/http://www.cathdb.info/

http://scop.mrc-lmb.cam.ac.uk/scop/

http://www.cathdb.info/

dbSNP is a public-domain archive for simple genetic polymorphisms

Single Nucleotide Polymorphism database (NCBI)

Each dbSNP entry has a code rsxx (RefSNP) or ssxx (submitted SNP) single-base nucleotide substitutions (also known as

single nucleotide polymorphisms or SNPs),

small-scale multi-base deletions or insertions (also called deletion insertion polymorphisms or DIPs)

retroposable element insertions and microsatellite repeat variations (also called short tandem repeats or STRs).

Synchronized with new genome builds

http://www.ncbi.nlm.nih.gov/projects/SNP/

Expression data can be sequence-based or hybridisation-based

Sequence-based (ESTs - RNA seq - SAGE)

Digital gene expression/northern

Microarray databases – hybridisation based: GEO: gene expression omnibus (NCBI)

− Platform: GPLxxxxxxx

− Experiment: GSExxxxxx (= several samples)

− Sample: GSMxxxxxxxx

− Some experiments are curated: GDSxxxxx (online analysis possible)

ArrayExpress (EBI)

http://www.ncbi.nlm.nih.gov/geo/

http://www.ebi.ac.uk/arrayexpress/

Example of expression data at GEO

Example at ArrayExpress

Entrez interconnects the databases at NCBI for easy querying

UniGene : sequences grouped by gene PopSet : sequence alignments for population

studies and phylogeny Structure : 3D structures (PDB) Genome : genomic maps of chromosomes and

plasmids UniSTS (Sequence Tagged Sites) PubMed : literature abstracts (MEDLINE,…) OMIM (Online Mendelian Inheritance in Man) :

literature reviews, Mesh (Medical Subject Headings) : keywords Taxonomy

Finding relevant data

Summarizing most important links to discover everything you need ...

Protein dataInterpro (heavily integrated with EBI resources)

http://www.interpro.org

Gene dataEntrez at NCBI : 'Entrez Gene'

http://www.ncbi.nlm.nih.gov/Entrez/

Ebeye Search at EBI : excellent for cross-species

http://www.ebi.ac.uk/ebisearch/

http://www.interpro.org/

http://www.ncbi.nlm.nih.gov/Entrez/

http://www.ebi.ac.uk/ebisearch/

Hold back your horses!

Phew, where do I place this all?

Bioinformatics is all about different data, as versatile as life itself

Due to the strong cross-references between different databases, new databases and relevant info are rapidly integrated in existing databases.

You can discover them by taking time to read the entries.

New tools are emerging everyday to enable you to browse all data sources...

BioGPS, all in one window!

http://biogps.gnf.org/#goto=welcome

New tools are emerging everyday to enable you to browse all data sources...

Integrative resources are increasingly being organised on a species basis

EMAGE database of in situ gene expression in mouse

OMIM Database of diseases in man

Websites providing an interface to integrate all this data is increasingly important

Often organized on a species basis− TAIR

− Flybase

− Wormbase

http://www.emouseatlas.org/emage/home.php

http://www.ncbi.nlm.nih.gov/omim?term=smoking

http://www.arabidopsis.org/

http://flybase.org/

http://www.wormbase.org/

The organizing biological data information by species

By species, why?

There is one biological information resource which stays

more or less unchanged per species ...

bits: overview of important biological databases beyond sequences

Education