ebi is an outstation of the european molecular biology laboratory. uniprotkb sandra orchard

Post on 21-Jan-2016

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

EBI is an Outstation of the European Molecular Biology Laboratory.

UniProtKB

Sandra Orchard

Importance of reference protein sequence databases

• Completeness and minimal redundancy

A non redundant protein sequence database, with maximal coverage including splice isoforms, disease variant and PTMs.

Low degree of redundancy for facilitating peptide assignments

• Stability and consistency Stable identifiers and consistent nomenclature

Databases are in constant change due to a substantial amount of work to improve their completeness and the quality of sequence annotation

• High quality protein annotation

Detailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source

Summary of protein sequence databases

Database Description Species

UniProtKB Expertly curated section (UniProtKB/Swiss-Prot) and computer-annotated section (UniProtKB/TrEMBL); minimum level of redundancy; high level of integration with other databases; stable identifiers; diversity of sources including large scale genomics, small scale cloning and sequencing, protein sequencing, PDB, predicted sequences from Ensembl and RefSeq

Many

UniRef100 Assembled from UniProtKB, Ensembl and RefSeq; merges 100% identical sequences; stable identifiers

Many

Ensembl Predictions using automated genome annotation pipeline; explicitly linked to nucleotide and protein sequences; stable reference; merge their annotations with Vega annotations at transcript level; extensive quality checks to remove erroneous gene models ; high level of integration with other databases

Over 50 Eukaryotic genomesEnsembl Genomes: Metazoa, Plants and Fungi, Protists, Bacteria and Archaea

RefSeq NCBI creates from existing data; ongoing curation; non-redundant; explicitly linked nucleotide and protein sequences; stable reference; high level of integration with other databases

Limited to fully sequenced organisms

Entrez protein (NCBInr) Assembled from GenBank and RefSeq coding sequence translations and UniProt KB ; annotations extracted from source curated databases; high degree of sequence redundancy

Many

Updated from Nesvizhskii, A. I., and Aebersold, R. (2005) Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell. Proteomics. 4,1419–1440l

UniProtKB

Master headline

UniProt Knowledgebase: 2 sections

1. UniProtKB/Swiss-Prot Non-redundant, high-quality manual annotation - reviewed

2. UniProtKB/TrEMBL Redundant, automatically annotated - unreviewed

www.uniprot.org

Sequence Sequence features

Ontologies

ReferencesNomenclature

Splice variants

Annotations

UniProtKB

Manual annotation of UniProtKB/Swiss-Prot

Master headline

Sequence curation, stable identifiers, versioning and archiving

For example – erroneous gene model predictions, frameshifts….

..premature stop codons, read-throughs, erroneous initiator methionines…..

Master headline

Splice variants

Master headline

Identification of amino acid variants

..and of PTMs

… and also

Master headline

Domain annotation

Binding sites

Master headline

Protein nomenclature

Master headline

Master headline

Controlled vocabularies used whenever possible…

Annotation - >30 defined fields

Master headline

..and also imported from external resources

Binary interactions taken from the IntAct database

Interactors of human p53

Master headline

Controlled vocabulary usage increasing – for example from the Gene Ontology

Annotation for human Rhodopsin

1 Evidence at protein levelThere is experimental evidence of the existence of a protein

(e.g. Edman sequencing, MS, X-ray/NMR structure, good quality protein-protein interaction , detection by antibodies)

2 Evidence at transcript levelThe existence of a protein has not been proven but there is expression data (e.g. existence of cDNAs, RT-PCR or Northern blots)

that indicates the existence of a transcript.

3 Inferred from homologyThe existence of a protein is likely because orthologs exist in closely related species

4 Predicted

5 Uncertain

Sequence evidence

Type of evidence that supports the existence of a protein

Manual annotation of the human proteome(UniProtKB/Swiss-Prot)

• A draft of the complete human proteome has been available in UniProtKB/Swiss-Prot since 2008

• Manually annotated representation of 20,231 protein coding genes with 36,865 protein sequences - an additional 33,243 UniProtKB/TrEMBL form the complete proteome set

• Approximately 67,600 single amino acid polymorphisms (SAPs), mostly disease-linked

• ~75,500 post-translational modifications (PTMs)• Close collaboration with NCBI, Ensembl, Sanger Institute

and UCSC to provide the authoritative set to the user community

Master headline

Searching UniProt – Simple Search

• Text-based searching• Logical operators ‘&’ (and), ‘|’

Master headline

Searching UniProt – Advanced Search

Master headline

Searching UniProt – Search Results

Each linked to the UniProt entry

Master headline

Searching UniProt – Search Results

Master headline

Searching UniProt – Search Results

Master headline

Searching UniProt – Blast Search

Master headline

Searching UniProt – Blast Search

Master headline

Searching UniProt – Blast Results

Alignment with query sequence

Master headline

Searching UniProt – Blast Results

UniProtKB/TrEMBL

Multiple entries for the same protein (redundancy) can arise in UniProtKB/TrEMBL due to:

o Erroneous gene model predictionso Sequence errors (Frame shifts)o Polymorphismso Alternative start siteso Isoforms

Apart from 100% identical sequences all merged sequences are analysed by a curator so they can be annotated accordingly.

Why do we need predictive annotation tools?

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

UniProtKB

UniProtKB/Swiss-Prot

Date

Num

ber

of s

eque

nces

Master headline

• Automated clean-up of annotation from original nucleotide sequence entry

• Additional value added by using automatic annotation

• Recognises common annotation belonging to a

closely related family within UniProtKB/Swiss-Prot

• Identifies all members of this family using pattern/motif/HMMs in InterPro

• Transfers common annotation to related family members in TrEMBL

Automatic Annotation

← Name (non-standard)

← Taxonomy

← Publication

← Sequence

Master headline

InterPro

Master headline

Finding a complete proteome in UniProtKB

Complete Proteomes

MS Proteomics

• Require each sequence (inc isoforms) to be present in the dataset as an separate entity for search engines to access

• For higher organisms, with isoforms, expanded set made available on ftp site

• Fasta files by FTP• One file per species containing canonical + isoform sequences

Master headline

????

??? ?

??

?

?

?

?

?

?

??

?

?

? ?

?

top related