bits: overview of important biological databases beyond sequences
DESCRIPTION
Module 4 Other relevant biological data sources beyond sequences Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/trainingTRANSCRIPT
Basic bioinformatics concepts, databases and tools
Module 4
Beyond the sequences
Dr. Joachim Jacob
http://www.bits.vib.be
Updated Nov 2011http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod4-intro_H1_2011_otherRelevantData.pdf
Module 4 broadens our view
To understand life, we need not only sequences, but many other concepts
Bioinformatics is also storing and analyzing− gene information: variations, isoforms,...
− Expression data
− 3D protein structure data
− Interaction data
− Pathways and network
“Storing all relevant biological data”
Schematic view II
GeneA sequence annotations – gene expr – pathway – struct,...
GeneB sequence annotations – gene expr – pathway – struct,...
GeneC sequence annotations – gene expr – pathway – struct,...
analysis
Primary databaseOther sequence databases
results
Additional information sources
results
The indispensable databases
Gene Ontology – structuring KEGG – biochemical pathways PDB – Structure of proteins Intact – Interaction data dbSNP – database of genomic variation Expression sources – Microarray data
Gene Ontology structures the way we communicate about life
http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax.pdf
http://www.arabidopsis.org/help/tutorials/go1.jsp
Gene translation Protein synthesisProtein production
Gene Ontology structures life
http://www.geneontology.org/
Agreement on standardized keywords (often referred to as 'controlled vocabularies'), describing all natural processes in an hierarchical way (ontology).
Keywords are assigned to genes based different evidence
Keywords are ordered in a hierarchical tree-like structure ( 'directed acyclic graphs')
Three GO 'trees' exists, describing:
"Biological Process"
"Cellular Component"
"Molecular Function"
http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax.pdf
http://www.arabidopsis.org/help/tutorials/go1.jsp
A gene can be given different GO terms
Example, cytochrome c:
molecular function: oxidoreductase activity,
biological process: oxidative phosphorylation and induction of cell death,
cellular component: mitochondrial matrix and mitochondrial inner membrane.
In each tree, the terms are organised in a directed acyclic graph: a network consisting of parents and child-terms (as nodes) and lines between them as relationships.
Different evidence codes can assign a degree of confidence to the assignment
http://www.geneontology.org/GO.evidence.shtml
Evidence codes can be grouped by: Experimental (e.g. IDA – inferred from direct assay)
Computational analysis
Author statement
Curator statement
Inferred from electronic annotation (IEA)
If available, each annotation has also a reference
Different evidence codes can assign a degree of confidence to the assignment
Gene Ontology structures all genes according to their biological significance
The GO structure and the terms can be browsed by a browser called AmiGO.
The Quick Go from EBI has some nice visualisation
Excellent GO-wiki for all your questions
GO can be used to retrieve all gene (products) related to one specific term
You can search broad, e.g. Amigo search for Diabetes leads to following GO term
http://amigo.geneontology.org/
GO can be used to retrieve all gene (products) related to one specific term
Amigo search for Diabetes
GO can be used to retrieve all gene (products) related to one specific term
Amigo search for Diabetes
GO is also useful to analyze and compare different gene lists
A lot of tools on GO are available on website.
http://www.geneontology.org/GO.tools.shtml
Some things to know about GO
For analyses, one can make use of 'shrinked' GO sets, the so-called GO-slims
– GO slims are a subset of biologically more relevant GO terms (available per species)
– GO ontologies can be downloaded in .obo format.
Not all information is captured by GO and need to be retrieved in other databases
Metabolic pathways: KEGG, …
Phenotype/diseases
• Mapping files exists e.g. kegg2go
http://www.geneontology.org/GO.slims.shtml
Biological pathways databases organise genes by molecular reactions
3 important databases on biological pathways
http://www.kegg.jp/
http://www.reactome.org/ - EBI
http://metacyc.org
Proteins with enzymatic function receive an Enzyme Commission (EC) number
http://www.chem.qmul.ac.uk/iubmb/enzyme/
EC 6 Ligases
EC 5 Isomerases
EC 4 Lyases
EC 3 Hydrolases
EC 2 Transferases
EC 1 Oxidoreductases
IntAct database contains interaction information of proteins
http://www.ebi.ac.uk/intact
Three types of interactions stored Protein-protein Protein-dna Protein-small molecule
IntAct database represents all interactions as binary: caution!
Interaction networks can be analysed on your computer using Cytoscape
Cytoscape training material on the BITS website
PDB hosts 3-dimensional structural data on molecules
PDB hosts 3-dimensional structural data on molecules
PDB = Protein DataBankhttp://www.pdb.org/pdb/home/home.do
Only structures resolved through NMR and X-ray (or other accurate techniques)
Proteins DNA RNA Ligands
Understanding PDB data: tutorial
PDB files can be read by a lot of different tools to display the structure
Every entry in PDB contains its own PDB accession number (often 1 digit and three letters)
The PDB file contains 3D coordinates from every single atom in the structure, together with variability of that position (last two digits)
http://www.bits.vib.be/index.php?option=com_content&view=article&id=17203817:protein-structure-analysis-training&catid=81:training-pages&Itemid=190
PDB files can be read by a lot of different tools to display the structure
Tools to visualize (and some to analyze structures) (see BITS wiki)
http://www.bits.vib.be/wiki/index.php/Protein_structure
To find a structure for your protein sequence is to search for similarity
Homology modeling
Similarity on sequence level projected to a structure Blast your query against PDB db by cblast , or at expasy
PSI-BLAST - can detect sequences with similar structures (twilight zone!)
If still no success: 3D-jury (a meta approach, including fold recognition and local structure prediction)
Similarity on structural level: aligning structures VAST (structure)
Distance mAtrix aLIgnment DALI
http://www.ii.uib.no/~slars/bioinfocourse/PDFs/structpred_tutorial.pdfhttp://consurf.tau.ac.il/pe/protexpl/psbiores.htm
BITS training on protein structure analysis
Tools at EBI
Structural information is used to classify proteins
SCOP
Groups proteins based on evolutionary, domain architecture and structural information.
CATH
Manually curated classification on protein domains
Database cross-references in PDB entry
http://scop.mrc-lmb.cam.ac.uk/scop/http://www.cathdb.info/
dbSNP is a public-domain archive for simple genetic polymorphisms
Single Nucleotide Polymorphism database (NCBI)
Each dbSNP entry has a code rsxx (RefSNP) or ssxx (submitted SNP) single-base nucleotide substitutions (also known as
single nucleotide polymorphisms or SNPs),
small-scale multi-base deletions or insertions (also called deletion insertion polymorphisms or DIPs)
retroposable element insertions and microsatellite repeat variations (also called short tandem repeats or STRs).
Synchronized with new genome builds
Expression data can be sequence-based or hybridisation-based
Sequence-based (ESTs - RNA seq - SAGE)
Digital gene expression/northern
Microarray databases – hybridisation based: GEO: gene expression omnibus (NCBI)
− Platform: GPLxxxxxxx
− Experiment: GSExxxxxx (= several samples)
− Sample: GSMxxxxxxxx
− Some experiments are curated: GDSxxxxx (online analysis possible)
ArrayExpress (EBI)
Example of expression data at GEO
Example of expression data at GEO
Example of expression data at GEO
Example at ArrayExpress
Example at ArrayExpress
Entrez interconnects the databases at NCBI for easy querying
UniGene : sequences grouped by gene PopSet : sequence alignments for population
studies and phylogeny Structure : 3D structures (PDB) Genome : genomic maps of chromosomes and
plasmids UniSTS (Sequence Tagged Sites) PubMed : literature abstracts (MEDLINE,…) OMIM (Online Mendelian Inheritance in Man) :
literature reviews, Mesh (Medical Subject Headings) : keywords Taxonomy
Finding relevant data
Summarizing most important links to discover everything you need ...
Protein dataInterpro (heavily integrated with EBI resources)
http://www.interpro.org
Gene dataEntrez at NCBI : 'Entrez Gene'
http://www.ncbi.nlm.nih.gov/Entrez/
Ebeye Search at EBI : excellent for cross-species
http://www.ebi.ac.uk/ebisearch/
Hold back your horses!
Phew, where do I place this all?
Bioinformatics is all about different data, as versatile as life itself
Due to the strong cross-references between different databases, new databases and relevant info are rapidly integrated in existing databases.
You can discover them by taking time to read the entries.
New tools are emerging everyday to enable you to browse all data sources...
BioGPS, all in one window!
New tools are emerging everyday to enable you to browse all data sources...
Integrative resources are increasingly being organised on a species basis
EMAGE database of in situ gene expression in mouse
OMIM Database of diseases in man
Websites providing an interface to integrate all this data is increasingly important
Often organized on a species basis− TAIR
− Flybase
− Wormbase
The organizing biological data information by species
By species, why?
There is one biological information resource which stays
more or less unchanged per species ...