sequencing the world of possibilities for energy & environment mgm workshop. 19 oct 2010...

34
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources Information Sources for Genomics for Genomics Konstantinos Mavrommatis Konstantinos Mavrommatis Genome Biology Program Genome Biology Program [email protected] [email protected]

Upload: lilian-patrick

Post on 14-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Information Sources Information Sources for Genomicsfor Genomics

Konstantinos MavrommatisKonstantinos Mavrommatis

Genome Biology ProgramGenome Biology [email protected]@lbl.gov

Page 2: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

DatabasesDatabases

Databases used for the analysis of biological molecules.

Databases contain information organized in a way that allows users/researchers to retrieve and exploit it.

Why bother?Store information.Organize data.Predict features (genes, functions ...).Predict the functional role of a feature (annotation).Understand relationships (metabolic reconstruction).

Page 3: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

OverviewOverview

Sequence databasesPrimary (contain “raw” data)

NucleotideProtein

Secondary (processed information)GenesProteins

Classification databasesSequence classificationFunction classificationOther methods

Other specialized databases

Page 4: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Primary nucleotide databasesPrimary nucleotide databases

EMBL/GenBank/DDBJ EMBL/GenBank/DDBJ ((http://www.ncbi.nlm.nih.gov/,http://www.ebi.ac.uk/embl,http://www.ebi.ac.uk/embl))

Archive containing all sequences from: genome projects sequencing centers individual scientists patent offices

The sequences are exchanged between the three centers on a daily basis.

Database is doubling every 10 months.

Sequences from >140,000 different species.

1400 new species added every month.

Year Base pairs Sequences2004 44,575,745,17640,604,3192005 56,037,734,46252,016,7622006 69,019,290,70564,893,7472007 83,874,179,73080,388,3822008 99,116,431,94298,868,465

Page 5: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Primary protein sequence databasesPrimary protein sequence databases

Contain coding sequences derived from the translation of nucleotide sequencesGenBank

Valid translations (CDS) from nt GenBank entries.

UniProtKB/TrEMBL (1996) Automatic CDS translations

from EMBL. TrEMBL Release 40.3 (26-May-

2009) contains 7,916,844 entries.

Page 6: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Errors in databasesErrors in databases

There are a lot of errors in the primary sequence databases: In the sequences themselves:

Sequencing errors.Cloning vectors sequences.

For the annotations, the free submission of entries results to:Inaccuracies, omissions, and even mistakes.Inconsistencies between some fields.

Page 7: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

RedundancyRedundancy

Redundancy is a major problem.

Entries are partially or entirely duplicated:

e.g. 20% of vertebrate sequences in GenBank.

{ {

{

Partial and completesequence duplications

Page 8: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

OverviewOverview

Sequence databasesPrimary (contain “raw” data)

NucleotideProtein

Secondary (processed information)GenesProteins

Classification databasesSequence classificationFunction classificationOther methods

Other specialized databases

Page 9: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

NCBI Derivative Sequence DataNCBI Derivative Sequence Data

ATTGACTA

TTGACA

CGTGAAT

TGACTA

TATAGCCG

ACGTGC

ACGTGCACGTGC

TTGACA

TTGACA

TTGACA

CGTGA

CGTGA

CGTGA

ATTGACTA

ATTGACTAATTGACTA

ATTGACTA

TATAGCCG

TATAGCCG

TATAGCCG

TATAGCCG

GenBank

TATAGCCG TATAGCCGTATAGCCGTATAGCCG

ATGA

CATT

GAGA

ATT

ATTCC GAGA

ATTCCGAGA

ATTC GAGA

ATTC

GAGA

ATTCC GAGA

ATTCC

UniGene

RefSeq

GenomeAssembly

Labs

Curators

Algorithms

TATAGCCGAGCTCCGATACCGATGACAA

Page 10: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

RefSeqRefSeq

Curated transcripts and proteins. reviewed by NCBI staff.

Model transcripts and proteins. generated by computer algorithms.

Assembled Genomic Regions (contigs).Chromosome records.

Page 11: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Secondary protein databasesSecondary protein databases

Uniprot/SWISS-PROT (1986) (http://ca.expasy.org/spro) a curated protein sequence database

high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.)

a minimal level of redundancy

high level of integration with other databases

Page 12: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

OverviewOverview

Sequence databasesPrimary (contain “raw” data)

NucleotideProtein

Secondary (processed information)GenesProteins

Classification databasesSequence classificationFunction classificationOther methods

Other specialized databases

Page 13: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Groups (families/clusters) of proteins based on…Overall sequence similarity.

Local sequence similarity.

Presence / absence of specific features (active site, signal peptides… ).

Structural similarity.

...

These groups contain proteins with similar properties.

Specific function, enzymatic activity.

General function.

Evolutionary relationship.

Classification databasesClassification databases

Page 14: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Overall sequence similarityOverall sequence similarity

Page 15: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

COGs were delineated by comparing protein sequences encoded in 43 complete genomes representing 30 major phylogenetic lineages. Each Cluster has representatives of at least 3 lineages

A function (specific or broad) has been assigned to each COG.

http://www.ncbi.nlm.nih.gov/COG/

Clusters of orthologous groups (COGs)Clusters of orthologous groups (COGs)

Page 16: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Profiles & PfamProfiles & Pfam

A method for classifying proteins into groups exploits region similarities, which contain valuable information (domains/profiles).

These domains/profiles can be used to detect distant relationships, where only few residues are conserved.

Page 17: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Regions similarityRegions similarity

Page 18: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

PfamPfam

http://pfam.sanger.ac.uk

HMMs of protein alignments(local) for domains, or global (cover whole protein)

Page 19: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

TIGRfamTIGRfam

Full length alignments. Domain alignments. Equivalogs: families of

proteins with specific function.

Superfamilies: families of homologous genes.

HMMs

http://www.tigr.org/TIGRFAMs/

Page 20: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

KEGG orthologyKEGG orthology

Page 21: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Composite pattern databasesComposite pattern databases

To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro

Release 28.0 (Aug 10) contains 20837entries Central annotation resource, with pointers to its satellite dbs

http://www.ebi.ac.uk/interpro/

Page 22: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

* It is up to the user to decide if the annotation is correct *

Page 23: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

ENZYMEENZYME

Page 24: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

ENZYMEENZYME

http://ca.expasy.org/enzyme/

Page 25: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

KEGGKEGG

Contains information about biochemical pathways, and protein interactions.

http://www.kegg.com

Page 26: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

OverviewOverview

Sequence databasesPrimary (contain “raw” data)

NucleotideProtein

Secondary (processed information)GenesProteins

Classification databasesSequence classificationFunction classificationOther methods

Other specialized databases

Page 27: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Sequencing projects Sequencing projects GOLD

Information for ongoing and finished (meta)genomic projects.

Information about the metadata of genomes and metagenomic samples.

http://www.genomesonline.org

Page 28: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Literature searchLiterature search

PubMed

http://www.ncbi.nlm.nih.gov/Pubmed

Page 29: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Specialized databasesSpecialized databases

There is a large number of databases devoted to specific organisms.

For some model organisms there are often concurrent systems.

These databases are associated to sequencing or mapping projects.

Page 30: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Signal transduction, regulation, protein-protein interactions TRANSFAC (Transcription Factor database) BRITE (Biomolecular Relations in Information Transmission and

Expression database) DIP (Database of Interacting Proteins) BIND (Biomolecular Interaction Network database) BioCarta

Biochemical pathways KLOTHO (Biochemical Compounds Declarative database) BRENDA (enzyme information system) LIGAND (similar to Enzyme but with more information for

substrates) Gene order and co-occurrence

STRING

Other specialized databasesOther specialized databases

3D structuresPDB (Protein Data Bank)MMDB (Molecular Modelling Data Base)NRL_3D (Non-Redundant Library of 3D Structures)SCOP (Structural Classification of Proteins)

PolymorphismALFRED (Allele Frequency Database)

Molecular interactionsDIP (Database of Interacting proteins)BIND (Biomolecular Interaction Network Database)

Gene expressionGXD (Mouse Gene Expression Database)The Stanford Microarray Database

MappingGDB (Genome Data Base)EMG (Encyclopedia of Mouse Genome)MGD (Mouse Genome Database)INE (Integrated Rice Genome Explorer)

Protein quantificationSWISS-2DPAGEPDD (Protein Disease Database)Sub2D (B. subtilis 2D Protein Index)

Page 31: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

List of databasesList of databases

http://www.oxfordjournals.org/nar/database/c

Page 32: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Databanks interconnectionDatabanks interconnection

SWISS-PROT

ENZYME

PDB

HSSP

SWISSNEW

YPDREF

YPD

PDBFINDERALI

DSSP

FSSP

NRL_3D

PMD

PIR

ProtFam

FlyGene

TFSITE

TFACTOR

EMBL

TrEMBL

ECDC

TrEMBLNEW

EMNEW

EPD

GenBank MOLPROBE

OMIM

MIMMAP

REBASE

PROSITE ProDom

PROSITEDOCBlocks

SWISSDOM

Not all databases are updated regularly. Changes of annotation in one database are not reflected in others.

Page 33: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Concluding remarksConcluding remarks

We have main archives (Genbank), and currated databases (Refseq, SwissProt), and protein classification database (COG, Pfam), and many, many more…

They help predict the function, or the network of functions.

Systems that integrate the information from several databases, visualize and allow handling of data in an intuitive way are required

Page 34: Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome

Sequencing the World of Possibilities for Energy & Environment

MGM workshop. 19 Oct 2010

Thank you for your attention.