uniprot

100
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall Protein Sequence Database:

Upload: haines

Post on 11-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Protein Sequence Database:. UniProt. Jennifer McDowall. Overview. The UniProt databases UniProt/SwissProt annotation UniProt/TrEMBL automatic annotation Using the uniprot.org website Computational access. 1) The UniProt databases. Source of protein sequence data. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: UniProt

EBI is an Outstation of the European Molecular Biology Laboratory.

UniProt

Jennifer McDowall

Protein Sequence Database:

Page 2: UniProt

22

Overview

1) The UniProt databases

2) UniProt/SwissProt annotation

3) UniProt/TrEMBL automatic annotation

4) Using the uniprot.org website

5) Computational access

Page 3: UniProt

1) The UniProt databases

Page 4: UniProt

44

Source of protein sequence data

Nucleotidesequencedatabase

Proteinsequencedatabase

Individual scientists

Large-scale sequencing

projects Patent Offices

Nucleotide sequencing

Submit

Submit

Protein sequencing

Deriveprotein

sequence

• Protein sequencing is rare

• Most protein sequence

derived from nucleotide data

• Protein sequencing is rare

• Most protein sequence

derived from nucleotide data

Page 5: UniProt

55

Protein sequence is mainly derived data

ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCTDNA sequence

translate

Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC

Derived protein sequence MRSNECCCAMSC

transcribe

submit

Page 6: UniProt

66

Protein sequence is mainly derived data

ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCTDNA sequence

translate

Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC

Derived protein sequence MRSNECCCAMSC

transcribe

submit

Predictedstop

Predictedstart

may not have direct evidence

Predictedsplice sites

Page 7: UniProt

77

How to find the information you need?

TATCTACAG

TAGAGGCTATCAGCA

CGCAGCACCAT

GACGCGCATAACT

GATCTACGA

TAGCGAGCAGCAGCA

CAGCATC

GCAGCATCAG

CTAAGCGACA

ATAGACATCA

AATCATCACGAT

GAATCATCGTCTACG

AGATCGC

CTATCTGT

High quality protein sequence

• Non-redundant data • Splice isoforms, disease variants, PTMs• Sequence archiving essential

Protein identification

• Stable identifiers • Consistent nomenclature

Protein annotation

• Information

protein functionbiological processesmolecular interactions

pathways

Page 8: UniProt

88

UniProt

Since 2002 a merger and collaboration of three databases:

Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database

Swiss-Prot & TrEMBL PIR-PSD

http://www.uniprot.org/http://www.uniprot.org/

Page 9: UniProt

99

UniProt Consortium

Page 10: UniProt

1010

Where does the data come from?S

eque

nce

sour

ces

UniParcENA

exchange

data daily

Page 11: UniProt

1111

Where does the data come from?

more…

Seq

uenc

e so

urce

s

ENA

Modelorganisms

PDB

RefSeq

Ensembl

VEGA

Patents

UniParc

UniMES UniProtKB/TrEMBL

Metagenomic &environmental

Taxonomyknown

History of sequencesHistory of

sequences

High quality annotation

High quality annotation

UniProtKB/SwissProt

Removeredundancy

Manualannotation

Page 12: UniProt

1212

Where does the data come from?

UniParc

UniMES UniProtKB/TrEMBL

Metagenomic &environmental

Taxonomyknown

UniProtKB/SwissProt

UniMESClusters

UniRefClusters

more…

Seq

uenc

e so

urce

s

ENA

Modelorganisms

PDB

RefSeq

Ensembl

VEGA

Patents

Page 13: UniProt

1313

4 components of UniProt

UniParc

UniMES

Swiss-Prot: non-redundant, manual annotation

TrEMBL: redundant, automatic annotation

Combines sequences (speed searching)

UniRef100, UniRef90, UniRef50

Complete history of sequences (no annotation)

Cross-links to external sequence sources

Sequences from metagenomic projects

UniProtKB

UniRef

Page 14: UniProt

1414

Browsing a UniParc entry

Sequence

Navigate to individual entries

Download data

Deleted entries

identified (greyed out)

Accession

List of databases containing sequence

Page 15: UniProt

1515

Browsing a UniProtKB/SwissProt entry

References

Navigate to external data

sourcese.g. Ensembl

Download dataNames (synonyms)

and taxonomy

Ontologies

Protein attributes

Annotation

Protein interactionsSplice variants

Sequence features

General information

Sequence

Page 16: UniProt

1616

Browsing a UniRef90 entry

Cluster name

List of entries in cluster

Taxonomy of each entry

% identity of sequences in cluster

Status (SwissProt

and/or TrEMBL)

Faster and more sensitive sequence search with no

loss of information

Faster and more sensitive sequence search with no

loss of information

Page 17: UniProt

1717

Taxonomic distribution of species

Bacteria(61%)

Eukaryota (32%)

Archaea(4%)

Viruses(3%)

All kingdoms: Within Eukaryota:

Other mammals (27%)

Homo (12%)

Other (8%)

Nematoda(2%)

Insecta(5%)

Fungi(18%)

Viridiplantae(18%)

Other Vertebrata

(10%)

Page 18: UniProt

1818

SwissProt – most represented species

Mainly model organisms

Page 19: UniProt

1919

Protein Existence tag

Protein existence level:

Evidence at protein level

Evidence at transcript level

Inferred from homology

Predicted

Uncertain (mainly TrEMBL)

Total

13%

12%

70%

5%

-

!! Not sequence validation !!

Page 20: UniProt

2020

Protein existence categories

Protein existence level:

Evidence at protein level

Evidence at transcript level

Inferred from homology

Predicted

Uncertain (mainly TrEMBL)

!! Not sequence validation !!

Human

59%

37.5%

1%

0.5%

2%

Page 21: UniProt

2) UniProtKB/SwissProt

annotation

Page 22: UniProt

2222

Annotation sources for UniProtKB

UniProtKB

* Manual curation

* Literature-based annotation

* Sequence analysis

Transmembrane prediction

Transmembrane prediction

InterPro classification

InterPro classification

Signal predictionSignal prediction

Other predictionsOther predictions

Protein classification

* Automated annotation

PRIDE

GO

InterPro

IntAct

IntEnz

HAMAP

RESID

Functional infoFunctional info

Protein identification data

Protein identification data

Protein families and domains

Protein families and domains

Molecular interactionsMolecular interactions

EnzymesEnzymes

Microbial protein families

Microbial protein families

Post-translational modifications

Post-translational modifications

Som

e da

ta s

ourc

es f

or

anno

tatio

n

Data sources

Page 23: UniProt

2323

Features of UniProtKB

Sequence

Annotations

Nomenclature References

Ontologies

Splice variants

Sequence features

Page 24: UniProt

2424

A wealth of external links

125 links!

Page 25: UniProt

2525

SwissProt manual annotation

1. Protein sequence

2. Biological information

• Extract literature information

• Orthologue data propagation

• Protein sequence analysis...

• Merge available CDS (coding sequence)

• Annotate sequence discrepancies

• Report sequencing errors...

Page 26: UniProt

2626

Problem #1: sequence correction

~20% of Swiss-Prot entries required correction

• Typical problems:

– Unsolved conflicts (sequencing errors)

– Erroneous gene model predictions

– Wrong initiation sites

– Frameshifts...

Page 27: UniProt

2727

Sequence quality from genome projects

• Drosophila:

• Well-curated• 1.8% of gene models incorrect

• Arabidopsis:

• Annotated when sequenced, but no update• 19.5% of gene models incorrect

• Tetraodon nigroviridis:

• Automatic run through (no manual intervention)• >90% of gene models incorrect

Page 28: UniProt

2828

Sequence curation

Other examples of sequencing errors include:premature stop codons, read-throughs, erroneous initiator methionines

Sequencing errors

Page 29: UniProt

2929

Problem #2: proteome complexity

1 SwissProt entry = 1 gene (1 species)

genome~20,000 human

protein-coding genes

transcriptome~100,000 human

transcripts

alternative splicing, alternative initiation, mRNA editing...

proteome>1,000,000 human

proteins

Post-translational modification

Annotation of sequence differences

Page 30: UniProt

3030

Merging entries

1) Errors• Erroneous gene model predictions; sequence errors

2) Natural variation• Polymorphisms; Alternative start sites; Alternative splicing

Multiple entries for the same protein exist in TrEMBL (redundancy)

Apart from 100% identical sequences all merged sequences are analyzed by a curator so they can be annotated

accordingly.

Because of:

Page 31: UniProt

3131

Example

Multiple alignment of the end of the available GCR sequences:

Annotation of the sequence differences (protein diversity):

Page 32: UniProt

3232

Merging entries

Page 33: UniProt

3333

Sequence curation

Alternative Splicing

Page 34: UniProt

3434

Sequence curation

Alternative Splicing

Page 35: UniProt

3535

Sequence curation

Alternative Splicing

Page 36: UniProt

3636

Sequence curation

Alternative Splicing

Page 37: UniProt

3737

Sequence curation

Alternative Splicing

Page 38: UniProt

3838

Sequence curation

Identification of amino acid variants

....and of PTMs

....and also

Page 39: UniProt

3939

Sequence curation

Domain annotation

Binding sites

Page 40: UniProt

4040

SwissProt manual annotation

1. Protein sequence

2. Biological information

• Extract literature information

• Orthologue data propagation

• Protein sequence analysis...

• Merge available CDS (coding sequence)

• Annotate sequence discrepancies

• Report sequencing errors...

Page 41: UniProt

4141

Sources of annotated information

UniProtKB/SwissProt gathers

information from multiple sources:

• Publications (literature/PubMed)

• Prediction proteins (Prosite, Anabelle)

• Contact with experts

• Other databases

• Nomenclature committees

Page 42: UniProt

4242

Nomenclature

Synonyms useful for

literature searching

Synonyms useful for

literature searching

Page 43: UniProt

4343

Nomenclature

Provides synonyms

and cleavage

products of

bifunctional proteins

Provides synonyms

and cleavage

products of

bifunctional proteins

Page 44: UniProt

4444

Annotation comments

Controlled vocabularies used whenever possible…

>30 comment fields

Page 45: UniProt

4545

Disease association

Mendelian Inheritance in Man provides information on genetic

disease associations

Mendelian Inheritance in Man provides information on genetic

disease associations

Pharmacogenomics databasePharmacogenomics database

Page 46: UniProt

4646

Sequence annotation (Features)

…enable researchers

to obtain a summary

of what is known

about a protein…

Page 47: UniProt

4747

Sequence annotation (Features)

Feature (e.g. domain) highlighted on sequence

Feature (e.g. domain) highlighted on sequence

Page 48: UniProt

4848

Gene Ontology

2. Molecular Function

An elemental activity or task or job

• Protein kinase activity• Insulin binding• Insulin receptor activity

1. Biological Process

A commonly recognized series of events

• Cell division• Mitosis• Organelle fission

3. Cellular Component

Where a gene product is located

• Mitochondrion

• Mitochondrial matrix

• Mitochondrial membrane

Page 49: UniProt

4949

Gene Ontology

Annotation for human Rhodopsin:

Page 50: UniProt

5050

Imported annotation

Binary interactions are taken from the database

Interactors of human p53

Page 51: UniProt

5151

Evidence for annotation

UniProtKB/Swiss-Prot distinguishes between

experimental and predicted data

Type of evidence Evidence tag

1st: Experimental evidence Reference provided

2nd: Light experimental evidence Probable

3rd: Inferred by similarity with homologous protein By similarity

4th: Inferred by sequence prediction Potential

Page 52: UniProt

5252

Evidence for annotation

Proven

Proven

Potential

Proven

By similarity

Page 53: UniProt

5353

Sources references included

Page 54: UniProt

5454

Versioning and archiving

Page 55: UniProt

5555

Versioning and archiving

Able to compare

versions directly

Able to compare

versions directly

Page 56: UniProt

5656

Versioning and archiving

Page 57: UniProt

3) UniProtKB/TrEMBL automatic annotation

Page 58: UniProt

5858

UniProtKB/TrEMBL

!! Caution !!Quality of UniProtKB/TrEMBL entries

depends upon quality of submissions

in original EMBL/GenBank/DDBJ entry.

Page 59: UniProt

5959

Annotated proteins guide TrEMBL entries

• 379 annotated UniProtKB/Swiss-Prot entries

• 9,186 un-annotated UniProtKB/TrEMBL entries

Automatic annotation added using Swiss-Prot and InterPro (function prediction database)

Don’t want un-annotated TrEMBL to be skeleton entries with no information

Example for rhodopsin:

Page 60: UniProt

6060

Automatic annotation

UniProtKB uses 2 prediction programs:

UniRule:

maintains a set

of manual

annotation rules.

InterProSwiss-Prot

SAAS:

generates a set of

decision trees using

data mining.

(new set every

UniProtKB release)

Page 61: UniProt

6161

Automatic annotation - InterPro

Swiss-Prot

groups of related proteins

(same family or share domains)

TrEMBL

uncharacterised sequence

protein signatures

InterPro

automatic annotation

pipelineCGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG

CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG

manually annotated sequence

Page 62: UniProt

6262

Browsing a UniProtKB/TrEMBL entry

Name(could be clone name)

Automatic annotation . (derived from InterPro)

Ontologies (both automatic and

manual curation)

Taxonomy

Page 63: UniProt

4) Using the www.uniprot.org website

Page 64: UniProt

6464

www.uniprot.org

Useful Features

Integrated BLAST and Alignments

Batch retrieval in a variety of formats

Simple and modular advanced searching

Page 65: UniProt

6565

uniprot.org: anatomy of an entry

Entry Info

Link to UniSave

Link to UniRef

Variety of formats

Navigation bar

Customize order

Page 66: UniProt

6666

uniprot.org: anatomy of an entry

Entry Info

Link to UniSave

Link to UniRef

Variety of formats

Navigation bar

Customize order

Page 67: UniProt

6767

Searching UniProt

Search tools include:

• Text Search

• Blast sequence search

• Additional search engines through EBI (e.g. SSearch and FASTA)

http://www.uniprot.org/http://www.uniprot.org/

Page 68: UniProt

6868

Search

Powerful text search tool with

autocompletion and refinement options

look for UniProt entries and documentation

using biological information

Page 69: UniProt

6969

Search

Search sequence database,

literature, taxonomy…

Search sequence database,

literature, taxonomy…

More search

options

More search

options

Page 70: UniProt

7070

Search

Refine searchRefine search

Page 71: UniProt

7171

Search results

Page 72: UniProt

7272

Search results

Define type and order

of search results

Define type and order

of search results

Page 73: UniProt

7373

Search results

Each result linked to

the UniProt entry

Each result linked to

the UniProt entry

SwissProt

TrEMBL

SwissProt

TrEMBL

Select specific entriesSelect specific entries

Page 74: UniProt

7474

Search results

Can retrieve or

BLAST sequence

Can retrieve or

BLAST sequence

Keeps selected entries

throughout session

Keeps selected entries

throughout session

Page 75: UniProt

7575

Search results

Can retrieve or align

>2 sequences

Can retrieve or align

>2 sequences

Page 76: UniProt

7676

BLAST

A tool with standard options to search

sequences in UniProt databases by

sequence blast

Search refinement

(change parameters)

Search refinement

(change parameters)

Page 77: UniProt

7777

BLAST

Can query using protein

or nucleotide sequences

Can query using protein

or nucleotide sequences

Page 78: UniProt

7878

BLAST

P00750

Can query using identifier:

• UniProtKB accession (P00750)

• Specific version (P00750:2)

• Splice variant (P00750-2)

• Name (A4_HUMAN)

• UniParc accession (UPI0000000001)

• UniRef accession (UniRef100_P00750)

Can query using identifier:

• UniProtKB accession (P00750)

• Specific version (P00750:2)

• Splice variant (P00750-2)

• Name (A4_HUMAN)

• UniParc accession (UPI0000000001)

• UniRef accession (UniRef100_P00750)

Page 79: UniProt

7979

BLAST

= best

= should verify

= biological significance less likely

Threshold =

expectation (E)

value

Threshold =

expectation (E)

value

Provides cut-off between good and poor hits

Page 80: UniProt

8080

BLAST

Matrix = assigns

probability score

for each position

Matrix = assigns

probability score

for each position

Controls sensitivity of search

Page 81: UniProt

8181

BLAST

Stretches of cysteines or hydrophobic regions can cause spurious matches

Replaces them with X’s

Filtering = masks low

complexity regions

Filtering = masks low

complexity regions

Page 82: UniProt

8282

BLAST

Gapped = allows gaps in sequence

• Yes = to find more distant homologues• No = to find closest matches (strict)

Gapped = allows gaps in sequence

• Yes = to find more distant homologues• No = to find closest matches (strict)

Page 83: UniProt

8383

BLAST

Hits = limits

number of results

Hits = limits

number of results

Page 84: UniProt

8484

BLAST results

Can filter or

customize results

Can filter or

customize results

Page 85: UniProt

8585

BLAST results

Shows length of

query sequence

aligned

Shows length of

query sequence

aligned

Select match to

see alignment

Select match to

see alignment

Page 86: UniProt

8686

BLAST results – pairwise alignment

Alignment of

selected sequence

Alignment of

selected sequence

Page 87: UniProt

8787

BLAST results – pairwise alignment

Colour alignment by

annotation or

properties

Colour alignment by

annotation or

properties

Page 88: UniProt

8888

BLAST results

...Further down the

results page…

details about matching

protein sequences

Further down the

results page…

details about matching

protein sequences

Page 89: UniProt

8989

BLAST results

.

.

.

Can align checked

sequences

Can align checked

sequences

Page 90: UniProt

9090

BLAST results – multiple alignment

Alignment of

selected sequence

Alignment of

selected sequence

Can add additional

sequences to

alignment

Can add additional

sequences to

alignment

Page 91: UniProt

9191

BLAST results – multiple alignment

Colour alignment

by annotation or

properties

Colour alignment

by annotation or

properties

Page 92: UniProt

9292

Align

ClustalW multiple alignment tool with

amino-acids highlighting options

and feature annotation highlighting option

Page 93: UniProt

9393

Retrieve

- retrieve a list of entries in several standard formats.

- then query retrieved sequences with UniProt search tool.

UniProt-specific tool:

Page 94: UniProt

9494

ID Mapping

Allows mapping between different

databases for a given protein

Page 95: UniProt

9595

Other tools

http://www.ebi.ac.uk/http://www.ebi.ac.uk/

Sequence Similarity & Analysis

Page 96: UniProt

9696

Other tools

BLASTBLAST

FASTAFASTA

specialized

searches

specialized

searches

http://www.ebi.ac.uk/Tools/sss/http://www.ebi.ac.uk/Tools/sss/

Page 97: UniProt

5) Computational access

Page 98: UniProt

9898

Computational access to UniProt

http://www.uniprot.org/http://www.uniprot.org/

Page 99: UniProt

9999

Computational access to UniProt

http://www.ebi.ac.uk/uniprot/http://www.ebi.ac.uk/uniprot/

Page 100: UniProt

100100

Acknowledgements

Rolf Apweiler

Ioanis Xenarios

Cathy H Wu

+100 annotators