uniprot

Post on 11-Jan-2016

35 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Protein Sequence Database:. UniProt. Jennifer McDowall. Overview. The UniProt databases UniProt/SwissProt annotation UniProt/TrEMBL automatic annotation Using the uniprot.org website Computational access. 1) The UniProt databases. Source of protein sequence data. - PowerPoint PPT Presentation

TRANSCRIPT

EBI is an Outstation of the European Molecular Biology Laboratory.

UniProt

Jennifer McDowall

Protein Sequence Database:

22

Overview

1) The UniProt databases

2) UniProt/SwissProt annotation

3) UniProt/TrEMBL automatic annotation

4) Using the uniprot.org website

5) Computational access

1) The UniProt databases

44

Source of protein sequence data

Nucleotidesequencedatabase

Proteinsequencedatabase

Individual scientists

Large-scale sequencing

projects Patent Offices

Nucleotide sequencing

Submit

Submit

Protein sequencing

Deriveprotein

sequence

• Protein sequencing is rare

• Most protein sequence

derived from nucleotide data

• Protein sequencing is rare

• Most protein sequence

derived from nucleotide data

55

Protein sequence is mainly derived data

ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCTDNA sequence

translate

Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC

Derived protein sequence MRSNECCCAMSC

transcribe

submit

66

Protein sequence is mainly derived data

ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCTDNA sequence

translate

Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC

Derived protein sequence MRSNECCCAMSC

transcribe

submit

Predictedstop

Predictedstart

may not have direct evidence

Predictedsplice sites

77

How to find the information you need?

TATCTACAG

TAGAGGCTATCAGCA

CGCAGCACCAT

GACGCGCATAACT

GATCTACGA

TAGCGAGCAGCAGCA

CAGCATC

GCAGCATCAG

CTAAGCGACA

ATAGACATCA

AATCATCACGAT

GAATCATCGTCTACG

AGATCGC

CTATCTGT

High quality protein sequence

• Non-redundant data • Splice isoforms, disease variants, PTMs• Sequence archiving essential

Protein identification

• Stable identifiers • Consistent nomenclature

Protein annotation

• Information

protein functionbiological processesmolecular interactions

pathways

88

UniProt

Since 2002 a merger and collaboration of three databases:

Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database

Swiss-Prot & TrEMBL PIR-PSD

http://www.uniprot.org/http://www.uniprot.org/

99

UniProt Consortium

1010

Where does the data come from?S

eque

nce

sour

ces

UniParcENA

exchange

data daily

1111

Where does the data come from?

more…

Seq

uenc

e so

urce

s

ENA

Modelorganisms

PDB

RefSeq

Ensembl

VEGA

Patents

UniParc

UniMES UniProtKB/TrEMBL

Metagenomic &environmental

Taxonomyknown

History of sequencesHistory of

sequences

High quality annotation

High quality annotation

UniProtKB/SwissProt

Removeredundancy

Manualannotation

1212

Where does the data come from?

UniParc

UniMES UniProtKB/TrEMBL

Metagenomic &environmental

Taxonomyknown

UniProtKB/SwissProt

UniMESClusters

UniRefClusters

more…

Seq

uenc

e so

urce

s

ENA

Modelorganisms

PDB

RefSeq

Ensembl

VEGA

Patents

1313

4 components of UniProt

UniParc

UniMES

Swiss-Prot: non-redundant, manual annotation

TrEMBL: redundant, automatic annotation

Combines sequences (speed searching)

UniRef100, UniRef90, UniRef50

Complete history of sequences (no annotation)

Cross-links to external sequence sources

Sequences from metagenomic projects

UniProtKB

UniRef

1414

Browsing a UniParc entry

Sequence

Navigate to individual entries

Download data

Deleted entries

identified (greyed out)

Accession

List of databases containing sequence

1515

Browsing a UniProtKB/SwissProt entry

References

Navigate to external data

sourcese.g. Ensembl

Download dataNames (synonyms)

and taxonomy

Ontologies

Protein attributes

Annotation

Protein interactionsSplice variants

Sequence features

General information

Sequence

1616

Browsing a UniRef90 entry

Cluster name

List of entries in cluster

Taxonomy of each entry

% identity of sequences in cluster

Status (SwissProt

and/or TrEMBL)

Faster and more sensitive sequence search with no

loss of information

Faster and more sensitive sequence search with no

loss of information

1717

Taxonomic distribution of species

Bacteria(61%)

Eukaryota (32%)

Archaea(4%)

Viruses(3%)

All kingdoms: Within Eukaryota:

Other mammals (27%)

Homo (12%)

Other (8%)

Nematoda(2%)

Insecta(5%)

Fungi(18%)

Viridiplantae(18%)

Other Vertebrata

(10%)

1818

SwissProt – most represented species

Mainly model organisms

1919

Protein Existence tag

Protein existence level:

Evidence at protein level

Evidence at transcript level

Inferred from homology

Predicted

Uncertain (mainly TrEMBL)

Total

13%

12%

70%

5%

-

!! Not sequence validation !!

2020

Protein existence categories

Protein existence level:

Evidence at protein level

Evidence at transcript level

Inferred from homology

Predicted

Uncertain (mainly TrEMBL)

!! Not sequence validation !!

Human

59%

37.5%

1%

0.5%

2%

2) UniProtKB/SwissProt

annotation

2222

Annotation sources for UniProtKB

UniProtKB

* Manual curation

* Literature-based annotation

* Sequence analysis

Transmembrane prediction

Transmembrane prediction

InterPro classification

InterPro classification

Signal predictionSignal prediction

Other predictionsOther predictions

Protein classification

* Automated annotation

PRIDE

GO

InterPro

IntAct

IntEnz

HAMAP

RESID

Functional infoFunctional info

Protein identification data

Protein identification data

Protein families and domains

Protein families and domains

Molecular interactionsMolecular interactions

EnzymesEnzymes

Microbial protein families

Microbial protein families

Post-translational modifications

Post-translational modifications

Som

e da

ta s

ourc

es f

or

anno

tatio

n

Data sources

2323

Features of UniProtKB

Sequence

Annotations

Nomenclature References

Ontologies

Splice variants

Sequence features

2424

A wealth of external links

125 links!

2525

SwissProt manual annotation

1. Protein sequence

2. Biological information

• Extract literature information

• Orthologue data propagation

• Protein sequence analysis...

• Merge available CDS (coding sequence)

• Annotate sequence discrepancies

• Report sequencing errors...

2626

Problem #1: sequence correction

~20% of Swiss-Prot entries required correction

• Typical problems:

– Unsolved conflicts (sequencing errors)

– Erroneous gene model predictions

– Wrong initiation sites

– Frameshifts...

2727

Sequence quality from genome projects

• Drosophila:

• Well-curated• 1.8% of gene models incorrect

• Arabidopsis:

• Annotated when sequenced, but no update• 19.5% of gene models incorrect

• Tetraodon nigroviridis:

• Automatic run through (no manual intervention)• >90% of gene models incorrect

2828

Sequence curation

Other examples of sequencing errors include:premature stop codons, read-throughs, erroneous initiator methionines

Sequencing errors

2929

Problem #2: proteome complexity

1 SwissProt entry = 1 gene (1 species)

genome~20,000 human

protein-coding genes

transcriptome~100,000 human

transcripts

alternative splicing, alternative initiation, mRNA editing...

proteome>1,000,000 human

proteins

Post-translational modification

Annotation of sequence differences

3030

Merging entries

1) Errors• Erroneous gene model predictions; sequence errors

2) Natural variation• Polymorphisms; Alternative start sites; Alternative splicing

Multiple entries for the same protein exist in TrEMBL (redundancy)

Apart from 100% identical sequences all merged sequences are analyzed by a curator so they can be annotated

accordingly.

Because of:

3131

Example

Multiple alignment of the end of the available GCR sequences:

Annotation of the sequence differences (protein diversity):

3232

Merging entries

3333

Sequence curation

Alternative Splicing

3434

Sequence curation

Alternative Splicing

3535

Sequence curation

Alternative Splicing

3636

Sequence curation

Alternative Splicing

3737

Sequence curation

Alternative Splicing

3838

Sequence curation

Identification of amino acid variants

....and of PTMs

....and also

3939

Sequence curation

Domain annotation

Binding sites

4040

SwissProt manual annotation

1. Protein sequence

2. Biological information

• Extract literature information

• Orthologue data propagation

• Protein sequence analysis...

• Merge available CDS (coding sequence)

• Annotate sequence discrepancies

• Report sequencing errors...

4141

Sources of annotated information

UniProtKB/SwissProt gathers

information from multiple sources:

• Publications (literature/PubMed)

• Prediction proteins (Prosite, Anabelle)

• Contact with experts

• Other databases

• Nomenclature committees

4242

Nomenclature

Synonyms useful for

literature searching

Synonyms useful for

literature searching

4343

Nomenclature

Provides synonyms

and cleavage

products of

bifunctional proteins

Provides synonyms

and cleavage

products of

bifunctional proteins

4444

Annotation comments

Controlled vocabularies used whenever possible…

>30 comment fields

4545

Disease association

Mendelian Inheritance in Man provides information on genetic

disease associations

Mendelian Inheritance in Man provides information on genetic

disease associations

Pharmacogenomics databasePharmacogenomics database

4646

Sequence annotation (Features)

…enable researchers

to obtain a summary

of what is known

about a protein…

4747

Sequence annotation (Features)

Feature (e.g. domain) highlighted on sequence

Feature (e.g. domain) highlighted on sequence

4848

Gene Ontology

2. Molecular Function

An elemental activity or task or job

• Protein kinase activity• Insulin binding• Insulin receptor activity

1. Biological Process

A commonly recognized series of events

• Cell division• Mitosis• Organelle fission

3. Cellular Component

Where a gene product is located

• Mitochondrion

• Mitochondrial matrix

• Mitochondrial membrane

4949

Gene Ontology

Annotation for human Rhodopsin:

5050

Imported annotation

Binary interactions are taken from the database

Interactors of human p53

5151

Evidence for annotation

UniProtKB/Swiss-Prot distinguishes between

experimental and predicted data

Type of evidence Evidence tag

1st: Experimental evidence Reference provided

2nd: Light experimental evidence Probable

3rd: Inferred by similarity with homologous protein By similarity

4th: Inferred by sequence prediction Potential

5252

Evidence for annotation

Proven

Proven

Potential

Proven

By similarity

5353

Sources references included

5454

Versioning and archiving

5555

Versioning and archiving

Able to compare

versions directly

Able to compare

versions directly

5656

Versioning and archiving

3) UniProtKB/TrEMBL automatic annotation

5858

UniProtKB/TrEMBL

!! Caution !!Quality of UniProtKB/TrEMBL entries

depends upon quality of submissions

in original EMBL/GenBank/DDBJ entry.

5959

Annotated proteins guide TrEMBL entries

• 379 annotated UniProtKB/Swiss-Prot entries

• 9,186 un-annotated UniProtKB/TrEMBL entries

Automatic annotation added using Swiss-Prot and InterPro (function prediction database)

Don’t want un-annotated TrEMBL to be skeleton entries with no information

Example for rhodopsin:

6060

Automatic annotation

UniProtKB uses 2 prediction programs:

UniRule:

maintains a set

of manual

annotation rules.

InterProSwiss-Prot

SAAS:

generates a set of

decision trees using

data mining.

(new set every

UniProtKB release)

6161

Automatic annotation - InterPro

Swiss-Prot

groups of related proteins

(same family or share domains)

TrEMBL

uncharacterised sequence

protein signatures

InterPro

automatic annotation

pipelineCGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG

CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG

manually annotated sequence

6262

Browsing a UniProtKB/TrEMBL entry

Name(could be clone name)

Automatic annotation . (derived from InterPro)

Ontologies (both automatic and

manual curation)

Taxonomy

4) Using the www.uniprot.org website

6464

www.uniprot.org

Useful Features

Integrated BLAST and Alignments

Batch retrieval in a variety of formats

Simple and modular advanced searching

6565

uniprot.org: anatomy of an entry

Entry Info

Link to UniSave

Link to UniRef

Variety of formats

Navigation bar

Customize order

6666

uniprot.org: anatomy of an entry

Entry Info

Link to UniSave

Link to UniRef

Variety of formats

Navigation bar

Customize order

6767

Searching UniProt

Search tools include:

• Text Search

• Blast sequence search

• Additional search engines through EBI (e.g. SSearch and FASTA)

http://www.uniprot.org/http://www.uniprot.org/

6868

Search

Powerful text search tool with

autocompletion and refinement options

look for UniProt entries and documentation

using biological information

6969

Search

Search sequence database,

literature, taxonomy…

Search sequence database,

literature, taxonomy…

More search

options

More search

options

7070

Search

Refine searchRefine search

7171

Search results

7272

Search results

Define type and order

of search results

Define type and order

of search results

7373

Search results

Each result linked to

the UniProt entry

Each result linked to

the UniProt entry

SwissProt

TrEMBL

SwissProt

TrEMBL

Select specific entriesSelect specific entries

7474

Search results

Can retrieve or

BLAST sequence

Can retrieve or

BLAST sequence

Keeps selected entries

throughout session

Keeps selected entries

throughout session

7575

Search results

Can retrieve or align

>2 sequences

Can retrieve or align

>2 sequences

7676

BLAST

A tool with standard options to search

sequences in UniProt databases by

sequence blast

Search refinement

(change parameters)

Search refinement

(change parameters)

7777

BLAST

Can query using protein

or nucleotide sequences

Can query using protein

or nucleotide sequences

7878

BLAST

P00750

Can query using identifier:

• UniProtKB accession (P00750)

• Specific version (P00750:2)

• Splice variant (P00750-2)

• Name (A4_HUMAN)

• UniParc accession (UPI0000000001)

• UniRef accession (UniRef100_P00750)

Can query using identifier:

• UniProtKB accession (P00750)

• Specific version (P00750:2)

• Splice variant (P00750-2)

• Name (A4_HUMAN)

• UniParc accession (UPI0000000001)

• UniRef accession (UniRef100_P00750)

7979

BLAST

= best

= should verify

= biological significance less likely

Threshold =

expectation (E)

value

Threshold =

expectation (E)

value

Provides cut-off between good and poor hits

8080

BLAST

Matrix = assigns

probability score

for each position

Matrix = assigns

probability score

for each position

Controls sensitivity of search

8181

BLAST

Stretches of cysteines or hydrophobic regions can cause spurious matches

Replaces them with X’s

Filtering = masks low

complexity regions

Filtering = masks low

complexity regions

8282

BLAST

Gapped = allows gaps in sequence

• Yes = to find more distant homologues• No = to find closest matches (strict)

Gapped = allows gaps in sequence

• Yes = to find more distant homologues• No = to find closest matches (strict)

8383

BLAST

Hits = limits

number of results

Hits = limits

number of results

8484

BLAST results

Can filter or

customize results

Can filter or

customize results

8585

BLAST results

Shows length of

query sequence

aligned

Shows length of

query sequence

aligned

Select match to

see alignment

Select match to

see alignment

8686

BLAST results – pairwise alignment

Alignment of

selected sequence

Alignment of

selected sequence

8787

BLAST results – pairwise alignment

Colour alignment by

annotation or

properties

Colour alignment by

annotation or

properties

8888

BLAST results

...Further down the

results page…

details about matching

protein sequences

Further down the

results page…

details about matching

protein sequences

8989

BLAST results

.

.

.

Can align checked

sequences

Can align checked

sequences

9090

BLAST results – multiple alignment

Alignment of

selected sequence

Alignment of

selected sequence

Can add additional

sequences to

alignment

Can add additional

sequences to

alignment

9191

BLAST results – multiple alignment

Colour alignment

by annotation or

properties

Colour alignment

by annotation or

properties

9292

Align

ClustalW multiple alignment tool with

amino-acids highlighting options

and feature annotation highlighting option

9393

Retrieve

- retrieve a list of entries in several standard formats.

- then query retrieved sequences with UniProt search tool.

UniProt-specific tool:

9494

ID Mapping

Allows mapping between different

databases for a given protein

9595

Other tools

http://www.ebi.ac.uk/http://www.ebi.ac.uk/

Sequence Similarity & Analysis

9696

Other tools

BLASTBLAST

FASTAFASTA

specialized

searches

specialized

searches

http://www.ebi.ac.uk/Tools/sss/http://www.ebi.ac.uk/Tools/sss/

5) Computational access

9898

Computational access to UniProt

http://www.uniprot.org/http://www.uniprot.org/

9999

Computational access to UniProt

http://www.ebi.ac.uk/uniprot/http://www.ebi.ac.uk/uniprot/

100100

Acknowledgements

Rolf Apweiler

Ioanis Xenarios

Cathy H Wu

+100 annotators

top related