Download - UniProt
EBI is an Outstation of the European Molecular Biology Laboratory.
UniProt
Jennifer McDowall
Protein Sequence Database:
22
Overview
1) The UniProt databases
2) UniProt/SwissProt annotation
3) UniProt/TrEMBL automatic annotation
4) Using the uniprot.org website
5) Computational access
1) The UniProt databases
44
Source of protein sequence data
Nucleotidesequencedatabase
Proteinsequencedatabase
Individual scientists
Large-scale sequencing
projects Patent Offices
Nucleotide sequencing
Submit
Submit
Protein sequencing
Deriveprotein
sequence
• Protein sequencing is rare
• Most protein sequence
derived from nucleotide data
• Protein sequencing is rare
• Most protein sequence
derived from nucleotide data
55
Protein sequence is mainly derived data
ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCTDNA sequence
translate
Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC
Derived protein sequence MRSNECCCAMSC
transcribe
submit
66
Protein sequence is mainly derived data
ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCTDNA sequence
translate
Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC
Derived protein sequence MRSNECCCAMSC
transcribe
submit
Predictedstop
Predictedstart
may not have direct evidence
Predictedsplice sites
77
How to find the information you need?
TATCTACAG
TAGAGGCTATCAGCA
CGCAGCACCAT
GACGCGCATAACT
GATCTACGA
TAGCGAGCAGCAGCA
CAGCATC
GCAGCATCAG
CTAAGCGACA
ATAGACATCA
AATCATCACGAT
GAATCATCGTCTACG
AGATCGC
CTATCTGT
High quality protein sequence
• Non-redundant data • Splice isoforms, disease variants, PTMs• Sequence archiving essential
Protein identification
• Stable identifiers • Consistent nomenclature
Protein annotation
• Information
protein functionbiological processesmolecular interactions
pathways
88
UniProt
Since 2002 a merger and collaboration of three databases:
Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database
Swiss-Prot & TrEMBL PIR-PSD
http://www.uniprot.org/http://www.uniprot.org/
99
UniProt Consortium
1010
Where does the data come from?S
eque
nce
sour
ces
UniParcENA
exchange
data daily
1111
Where does the data come from?
more…
Seq
uenc
e so
urce
s
ENA
Modelorganisms
PDB
RefSeq
Ensembl
VEGA
Patents
UniParc
UniMES UniProtKB/TrEMBL
Metagenomic &environmental
Taxonomyknown
History of sequencesHistory of
sequences
High quality annotation
High quality annotation
UniProtKB/SwissProt
Removeredundancy
Manualannotation
1212
Where does the data come from?
UniParc
UniMES UniProtKB/TrEMBL
Metagenomic &environmental
Taxonomyknown
UniProtKB/SwissProt
UniMESClusters
UniRefClusters
more…
Seq
uenc
e so
urce
s
ENA
Modelorganisms
PDB
RefSeq
Ensembl
VEGA
Patents
1313
4 components of UniProt
UniParc
UniMES
Swiss-Prot: non-redundant, manual annotation
TrEMBL: redundant, automatic annotation
Combines sequences (speed searching)
UniRef100, UniRef90, UniRef50
Complete history of sequences (no annotation)
Cross-links to external sequence sources
Sequences from metagenomic projects
UniProtKB
UniRef
1414
Browsing a UniParc entry
Sequence
Navigate to individual entries
Download data
Deleted entries
identified (greyed out)
Accession
List of databases containing sequence
1515
Browsing a UniProtKB/SwissProt entry
References
Navigate to external data
sourcese.g. Ensembl
Download dataNames (synonyms)
and taxonomy
Ontologies
Protein attributes
Annotation
Protein interactionsSplice variants
Sequence features
General information
Sequence
1616
Browsing a UniRef90 entry
Cluster name
List of entries in cluster
Taxonomy of each entry
% identity of sequences in cluster
Status (SwissProt
and/or TrEMBL)
Faster and more sensitive sequence search with no
loss of information
Faster and more sensitive sequence search with no
loss of information
1717
Taxonomic distribution of species
Bacteria(61%)
Eukaryota (32%)
Archaea(4%)
Viruses(3%)
All kingdoms: Within Eukaryota:
Other mammals (27%)
Homo (12%)
Other (8%)
Nematoda(2%)
Insecta(5%)
Fungi(18%)
Viridiplantae(18%)
Other Vertebrata
(10%)
1818
SwissProt – most represented species
Mainly model organisms
1919
Protein Existence tag
Protein existence level:
Evidence at protein level
Evidence at transcript level
Inferred from homology
Predicted
Uncertain (mainly TrEMBL)
Total
13%
12%
70%
5%
-
!! Not sequence validation !!
2020
Protein existence categories
Protein existence level:
Evidence at protein level
Evidence at transcript level
Inferred from homology
Predicted
Uncertain (mainly TrEMBL)
!! Not sequence validation !!
Human
59%
37.5%
1%
0.5%
2%
2) UniProtKB/SwissProt
annotation
2222
Annotation sources for UniProtKB
UniProtKB
* Manual curation
* Literature-based annotation
* Sequence analysis
Transmembrane prediction
Transmembrane prediction
InterPro classification
InterPro classification
Signal predictionSignal prediction
Other predictionsOther predictions
Protein classification
* Automated annotation
PRIDE
GO
InterPro
IntAct
IntEnz
HAMAP
RESID
Functional infoFunctional info
Protein identification data
Protein identification data
Protein families and domains
Protein families and domains
Molecular interactionsMolecular interactions
EnzymesEnzymes
Microbial protein families
Microbial protein families
Post-translational modifications
Post-translational modifications
Som
e da
ta s
ourc
es f
or
anno
tatio
n
Data sources
2323
Features of UniProtKB
Sequence
Annotations
Nomenclature References
Ontologies
Splice variants
Sequence features
2424
A wealth of external links
125 links!
2525
SwissProt manual annotation
1. Protein sequence
2. Biological information
• Extract literature information
• Orthologue data propagation
• Protein sequence analysis...
• Merge available CDS (coding sequence)
• Annotate sequence discrepancies
• Report sequencing errors...
2626
Problem #1: sequence correction
~20% of Swiss-Prot entries required correction
• Typical problems:
– Unsolved conflicts (sequencing errors)
– Erroneous gene model predictions
– Wrong initiation sites
– Frameshifts...
2727
Sequence quality from genome projects
• Drosophila:
• Well-curated• 1.8% of gene models incorrect
• Arabidopsis:
• Annotated when sequenced, but no update• 19.5% of gene models incorrect
• Tetraodon nigroviridis:
• Automatic run through (no manual intervention)• >90% of gene models incorrect
2828
Sequence curation
Other examples of sequencing errors include:premature stop codons, read-throughs, erroneous initiator methionines
Sequencing errors
2929
Problem #2: proteome complexity
1 SwissProt entry = 1 gene (1 species)
genome~20,000 human
protein-coding genes
transcriptome~100,000 human
transcripts
alternative splicing, alternative initiation, mRNA editing...
proteome>1,000,000 human
proteins
Post-translational modification
Annotation of sequence differences
3030
Merging entries
1) Errors• Erroneous gene model predictions; sequence errors
2) Natural variation• Polymorphisms; Alternative start sites; Alternative splicing
Multiple entries for the same protein exist in TrEMBL (redundancy)
Apart from 100% identical sequences all merged sequences are analyzed by a curator so they can be annotated
accordingly.
Because of:
3131
Example
Multiple alignment of the end of the available GCR sequences:
Annotation of the sequence differences (protein diversity):
3232
Merging entries
3333
Sequence curation
Alternative Splicing
3434
Sequence curation
Alternative Splicing
3535
Sequence curation
Alternative Splicing
3636
Sequence curation
Alternative Splicing
3737
Sequence curation
Alternative Splicing
3838
Sequence curation
Identification of amino acid variants
....and of PTMs
....and also
3939
Sequence curation
Domain annotation
Binding sites
4040
SwissProt manual annotation
1. Protein sequence
2. Biological information
• Extract literature information
• Orthologue data propagation
• Protein sequence analysis...
• Merge available CDS (coding sequence)
• Annotate sequence discrepancies
• Report sequencing errors...
4141
Sources of annotated information
UniProtKB/SwissProt gathers
information from multiple sources:
• Publications (literature/PubMed)
• Prediction proteins (Prosite, Anabelle)
• Contact with experts
• Other databases
• Nomenclature committees
4242
Nomenclature
Synonyms useful for
literature searching
Synonyms useful for
literature searching
4343
Nomenclature
Provides synonyms
and cleavage
products of
bifunctional proteins
Provides synonyms
and cleavage
products of
bifunctional proteins
4444
Annotation comments
Controlled vocabularies used whenever possible…
>30 comment fields
4545
Disease association
Mendelian Inheritance in Man provides information on genetic
disease associations
Mendelian Inheritance in Man provides information on genetic
disease associations
Pharmacogenomics databasePharmacogenomics database
4646
Sequence annotation (Features)
…enable researchers
to obtain a summary
of what is known
about a protein…
4747
Sequence annotation (Features)
Feature (e.g. domain) highlighted on sequence
Feature (e.g. domain) highlighted on sequence
4848
Gene Ontology
2. Molecular Function
An elemental activity or task or job
• Protein kinase activity• Insulin binding• Insulin receptor activity
1. Biological Process
A commonly recognized series of events
• Cell division• Mitosis• Organelle fission
3. Cellular Component
Where a gene product is located
• Mitochondrion
• Mitochondrial matrix
• Mitochondrial membrane
4949
Gene Ontology
Annotation for human Rhodopsin:
5050
Imported annotation
Binary interactions are taken from the database
Interactors of human p53
5151
Evidence for annotation
UniProtKB/Swiss-Prot distinguishes between
experimental and predicted data
Type of evidence Evidence tag
1st: Experimental evidence Reference provided
2nd: Light experimental evidence Probable
3rd: Inferred by similarity with homologous protein By similarity
4th: Inferred by sequence prediction Potential
5252
Evidence for annotation
Proven
Proven
Potential
Proven
By similarity
5353
Sources references included
5454
Versioning and archiving
5555
Versioning and archiving
Able to compare
versions directly
Able to compare
versions directly
5656
Versioning and archiving
3) UniProtKB/TrEMBL automatic annotation
5858
UniProtKB/TrEMBL
!! Caution !!Quality of UniProtKB/TrEMBL entries
depends upon quality of submissions
in original EMBL/GenBank/DDBJ entry.
5959
Annotated proteins guide TrEMBL entries
• 379 annotated UniProtKB/Swiss-Prot entries
• 9,186 un-annotated UniProtKB/TrEMBL entries
Automatic annotation added using Swiss-Prot and InterPro (function prediction database)
Don’t want un-annotated TrEMBL to be skeleton entries with no information
Example for rhodopsin:
6060
Automatic annotation
UniProtKB uses 2 prediction programs:
UniRule:
maintains a set
of manual
annotation rules.
InterProSwiss-Prot
SAAS:
generates a set of
decision trees using
data mining.
(new set every
UniProtKB release)
6161
Automatic annotation - InterPro
Swiss-Prot
groups of related proteins
(same family or share domains)
TrEMBL
uncharacterised sequence
protein signatures
InterPro
automatic annotation
pipelineCGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG
CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG
manually annotated sequence
6262
Browsing a UniProtKB/TrEMBL entry
Name(could be clone name)
Automatic annotation . (derived from InterPro)
Ontologies (both automatic and
manual curation)
Taxonomy
4) Using the www.uniprot.org website
6464
www.uniprot.org
Useful Features
Integrated BLAST and Alignments
Batch retrieval in a variety of formats
Simple and modular advanced searching
6565
uniprot.org: anatomy of an entry
Entry Info
Link to UniSave
Link to UniRef
Variety of formats
Navigation bar
Customize order
6666
uniprot.org: anatomy of an entry
Entry Info
Link to UniSave
Link to UniRef
Variety of formats
Navigation bar
Customize order
6767
Searching UniProt
Search tools include:
• Text Search
• Blast sequence search
• Additional search engines through EBI (e.g. SSearch and FASTA)
http://www.uniprot.org/http://www.uniprot.org/
6868
Search
Powerful text search tool with
autocompletion and refinement options
look for UniProt entries and documentation
using biological information
6969
Search
Search sequence database,
literature, taxonomy…
Search sequence database,
literature, taxonomy…
More search
options
More search
options
7070
Search
Refine searchRefine search
7171
Search results
7272
Search results
Define type and order
of search results
Define type and order
of search results
7373
Search results
Each result linked to
the UniProt entry
Each result linked to
the UniProt entry
SwissProt
TrEMBL
SwissProt
TrEMBL
Select specific entriesSelect specific entries
7474
Search results
Can retrieve or
BLAST sequence
Can retrieve or
BLAST sequence
Keeps selected entries
throughout session
Keeps selected entries
throughout session
7575
Search results
Can retrieve or align
>2 sequences
Can retrieve or align
>2 sequences
7676
BLAST
A tool with standard options to search
sequences in UniProt databases by
sequence blast
Search refinement
(change parameters)
Search refinement
(change parameters)
7777
BLAST
Can query using protein
or nucleotide sequences
Can query using protein
or nucleotide sequences
7878
BLAST
P00750
Can query using identifier:
• UniProtKB accession (P00750)
• Specific version (P00750:2)
• Splice variant (P00750-2)
• Name (A4_HUMAN)
• UniParc accession (UPI0000000001)
• UniRef accession (UniRef100_P00750)
Can query using identifier:
• UniProtKB accession (P00750)
• Specific version (P00750:2)
• Splice variant (P00750-2)
• Name (A4_HUMAN)
• UniParc accession (UPI0000000001)
• UniRef accession (UniRef100_P00750)
7979
BLAST
= best
= should verify
= biological significance less likely
Threshold =
expectation (E)
value
Threshold =
expectation (E)
value
Provides cut-off between good and poor hits
8080
BLAST
Matrix = assigns
probability score
for each position
Matrix = assigns
probability score
for each position
Controls sensitivity of search
8181
BLAST
Stretches of cysteines or hydrophobic regions can cause spurious matches
Replaces them with X’s
Filtering = masks low
complexity regions
Filtering = masks low
complexity regions
8282
BLAST
Gapped = allows gaps in sequence
• Yes = to find more distant homologues• No = to find closest matches (strict)
Gapped = allows gaps in sequence
• Yes = to find more distant homologues• No = to find closest matches (strict)
8383
BLAST
Hits = limits
number of results
Hits = limits
number of results
8484
BLAST results
Can filter or
customize results
Can filter or
customize results
8585
BLAST results
Shows length of
query sequence
aligned
Shows length of
query sequence
aligned
Select match to
see alignment
Select match to
see alignment
8686
BLAST results – pairwise alignment
Alignment of
selected sequence
Alignment of
selected sequence
8787
BLAST results – pairwise alignment
Colour alignment by
annotation or
properties
Colour alignment by
annotation or
properties
8888
BLAST results
...Further down the
results page…
details about matching
protein sequences
Further down the
results page…
details about matching
protein sequences
8989
BLAST results
.
.
.
Can align checked
sequences
Can align checked
sequences
9090
BLAST results – multiple alignment
Alignment of
selected sequence
Alignment of
selected sequence
Can add additional
sequences to
alignment
Can add additional
sequences to
alignment
9191
BLAST results – multiple alignment
Colour alignment
by annotation or
properties
Colour alignment
by annotation or
properties
9292
Align
ClustalW multiple alignment tool with
amino-acids highlighting options
and feature annotation highlighting option
9393
Retrieve
- retrieve a list of entries in several standard formats.
- then query retrieved sequences with UniProt search tool.
UniProt-specific tool:
9494
ID Mapping
Allows mapping between different
databases for a given protein
9595
Other tools
http://www.ebi.ac.uk/http://www.ebi.ac.uk/
Sequence Similarity & Analysis
9696
Other tools
BLASTBLAST
FASTAFASTA
specialized
searches
specialized
searches
http://www.ebi.ac.uk/Tools/sss/http://www.ebi.ac.uk/Tools/sss/
5) Computational access
9898
Computational access to UniProt
http://www.uniprot.org/http://www.uniprot.org/
9999
Computational access to UniProt
http://www.ebi.ac.uk/uniprot/http://www.ebi.ac.uk/uniprot/
100100
Acknowledgements
Rolf Apweiler
Ioanis Xenarios
Cathy H Wu
+100 annotators