classifying the protein universe synapse- associated protein 97 wu et al, 2002. embo j 19:5740-5751

Classifying the protein universe

Synapse-Associated Protein 97

Wu et al, 2002. EMBO J 19:5740-5751

Domain Analysis and Protein Families

Introduction What are protein families?

Motifs and Profiles The modular architecture of proteins

Domain Properties and Classification

Protein Families

Protein families are defined by homology In a family, everyone is related to everyone Everybody in a family shares a common

ancestor

Protein family 1 Protein family 2

Homology versus Similarity

Homologous proteins have similar 3D structures and (usually) share common ancestry

1chg

1sgt

1chg

1sgt

Superfamily: Trypsin-like Serine Proteases

1chg and 1sgt 31% identity, 43% similarity

We can infer homology from similarity!


But Homologous proteins may not share sequence similarity

1chg

1sgc

1chg

1sgc

Superfamily: Trypsin-like Serine Proteases

1chg and 1sgc 15% identity, 25% similarityWe cannot infer similarity from homology


Similar sequences may not have structural similarity

1chg

1chg

2baa

2baa

1chg and 2baa 30% similarity, 140/245 aaWe cannot assume homology from similarity!


Summary Sequences can be similar without being homologous Sequences can be homologous without being similar

Evolution /Homology

BLASTSimilarit

y

Families ??



Motifs and Profiles The modular architecture of proteins

Domain Properties and Classification

Technique to identify protein family

Search for profiles/motifs of biological significance that categorize a protein into a family

Pattern (motif) - a deterministic syntax that describes multiple combinations of possible residues within a protein string

Profile - probabilistic generalizations that assign to every segment position, a probability that each of the 20 aa will occur

Intermediate sequence search - link many profile searches

Automated Motif Discovery

Given a set of sequences: GIBBS Sampler

http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?data_type=protein

MEME - motif-based sequence analysis tools http://meme.sdsc.edu/meme/

PRATT - tool to discover patterns that are conserved in a set of protein sequences http://kr.expasy.org/tools/pratt/ http://www.ebi.ac.uk/pratt (advanced tool)

TEIRESIAS http://cbcsrv.watson.ibm.com/Tspd.html Combinatorial output

http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?data_type=protein

http://meme.sdsc.edu/meme/

http://meme.sdsc.edu/meme/

http://kr.expasy.org/tools/pratt/

http://kr.expasy.org/tools/pratt/

http://www.ebi.ac.uk/pratt



http://cbcsrv.watson.ibm.com/Tspd.html

http://cbcsrv.watson.ibm.com/Tspd.html

Motif Description of a Protein Family

Regular expressions:

........C.............S...L..I..DRY..I.......................W... I E W V

/ C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W /

Automated Profile Generation

Any multiple alignment is a profile!

PSI-BLAST Algorithm:

1. Start from a single query sequence2. Perform BLAST search3. Build profile of neighbours4. Repeat from 2 …

Very sensitive method for database search

PSI-BLAST

Position Specific Iterative BlastPSI-Blast profile models only positions in the query

sequence

Threshold for inclusion in profile

Query Profile1 Profile2

...After n iterations

HMMs

Hidden Markov Models are Statistical methods that consider all the possible combinations of matches, mismatches, and gaps to generate a consensus (Higgins, 2000)

Using HMMs

You can use HMM to create a model profile/PSSM (Position Specific Scoring Matrix) To create one, you need to have an multiple

alignment The more sequences in the multiple alignment, the

better the model created by HMM will be After creating HMM model, you can search a

database with it (Eg: PFAM)

HMM libraries

PFAM http://pfam.sanger.ac.uk The Pfam database is a large collection of protein

families, each represented by multiple sequence alignments and hidden Markov models (HMMs).

Pfam-A entries are high quality, manually curated families.

Pfam-B entries are generated automatically.

http://pfam.sanger.ac.uk/

http://pfam.sanger.ac.uk/

GTG

Graph clustering algorithm in which all known protein sequences simultaneously self-organize into hypothetical multiple sequence alignments Eliminates noise Enables fast sequence database searching methods

which are superior to profile-profile comparison at recognizing distant homologues

GTG steps

1. Generate alignment trace graph• Nodes = residues• Edges = aligned in PSI-Blast library• Unweighted

2. Edge weighting• Using consistency

3. Clustering • Driven by consistency• Single site occupancy rule

4. Post-processing• Generate non-redundant set of inter-cluster

edges• Identify sub-trees with conserved residues

Alignment trace graph

Protein 1Protein 2Protein 3Protein 4Protein 5

- Graph representation of input pairwise alignment data- Vertices = residues- Edges = aligned in a pairwise alignment from input library

Residues more residues

Consistency = neighbour overlap

i j

Weight = intersection / union

GTG – global trace graph

Input: PSI-Blast all versus all alignments in NRDB40

Output: superalignment of all proteins Applications

Pairwise alignment of query and target sequences

Transitive sequence database searching (fast) Tracking conserved residues (feature space)

Edge weight = consistency (fraction of common neighbours) Cluster ≈ hypothetical column of multiple alignment (single site

occupancy)

Cluster 2Cluster 1



Alignment trace graph

‘Motif tracking’

K

K

K

K

K

A

A

A

H

A

G

A

K

consistency

consistency

consistency

K

K

Each vertex is labelled with source protein and position in sequence.Motifs are subtrees enriched in one particular amino acid type.

Remote homolog detectionbased on GTG alignment score

Lindahl benchmark, superfamily level

0 20 40 60

*SPARKS-0

GTG-DEEP

*FOLDpro

*SP3

GTG-LOCAL

*Prospect II

*Fugue

Blastlink

SAM-T98

PSI-Blast

Ssearch

HMMer

% correct

top-1

top-5

GTG clustering is informative; detect as many remote homologs as threading methods

GTG summary

Super-families form elongated clusters in “protein space” Profile models fluctuations around an equilibrium

pointConsistency ~ path model

Exploits multiple profile models Discriminative in database searching

Global trace graph data structure Feature space for pattern discovery

http://ekhidna.biocenter.helsinki.fi/gtg/start


Relationships between families

Pfam clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence, structure or profile-HMM Superfamily

http://supfam.cs.bris.ac.uk/SUPERFAMILY/hmm.html The sequence search method uses a library (covering all

proteins of known structure) consisting of 1776 SCOP superfamilies from classes a to g. Each superfamily is represented by a group of hidden Markov models.

Pfam-squared Based on GTG comparisons of representative sequences from

each PFAM-A family against all PFAM-A families. Rules of thumb: motif score>1000 means probably related,

motif score >500 means possibly related, score <500 means dubious

http://supfam.cs.bris.ac.uk/SUPERFAMILY/hmm.html

http://supfam.cs.bris.ac.uk/SUPERFAMILY/hmm.html

http://scop.mrc-lmb.cam.ac.uk/scop

http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/representation.html

Benchmarking a motif/profile

You have a description of a protein family, and you do a database search…

Are all hits truly members of your protein family?

Benchmarking:

Datasetunknown

family membernot a family member

TP: true positiveTN: true negativeFP: false positiveFN: false negative

Result

Benchmarking a motif/profile

Precision / Selectivity Precision = TP / (TP + FP)

Sensitivity / Recall Sensitivity = TP / (TP + FN)

Balancing both: Precision ~ 1, Recall ~ 0: easy but useless Precision ~ 0, Recall ~ 1: easy but useless Precision ~ 1, Recall ~ 1: perfect but very difficult



Motifs and Profiles The modular architecture of

proteins Domain Properties and

Classification

The Modular Architecture of Proteins

BLAST search of a multi-domain protein

Phosphoglycerate kinase Triosephosphate isomerase

What are domains?

Functional - from experiments:

example: Decay Accelerating Factor (DAF) or CD55

Has six domains (units): 4x Sushi domain (complement

regulation)

1x ST-rich ‘stalk’

1x GPI anchor (membrane attachment)

PDB entry 1ojy (sushi domains only) P Williams et al (2003) Mapping CD55 Function. J Biol Chem 278(12):

10691-10696

There is only so much we can conclude…

Classifying domains to aid structure prediction predict structural domains and molecular function

of the domain Classifying complete sequences

predicting molecular function of proteins, large scale annotation

Majority of proteins are multi-domain proteins

What are domains?

Mobile module

Protein 1

Protein 2

Protein 3

Protein 4

Domains are...

Parts of protein sequences that can evolve, function, and exist independently of the rest of the protein chain

Each domain forms a compact three-dimensional structure and often can be independently stable and folded

Domains are...

...evolutionary building blocks: Families of evolutionarily-related sequence

segments Domain assignment often coupled with

classification To be precise,

we say: “protein family” we mean: “protein domain family”

Example: global alignment

Phthalate dioxygenase reductase (PDR_BURCE)

Toluene - 4 -monooxygenase electron transfer component (TMOF_PSEME)

Global alignment fails!Only aligns largest domain.

Sometimes domain stuctures are quite complex

PGBM_HUMAN: “Basement membrane-specificheparan sulphate proteoglycan core protein precursor”

http://pfam.sanger.ac.uk/protein/PGBM_HUMANhttp://au.expasy.org/cgi-bin/prosite/ScanView.cgi?scanfile=530255511812.scan.gzhttp://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html

45 domains of 7 different type, according to PROSITE

http://pfam.sanger.ac.uk/protein/PGBM_HUMAN

http://au.expasy.org/cgi-bin/prosite/ScanView.cgi?scanfile=530255511812.scan.gz

http://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html



Properties of domains

Most domains size approx 75 – 200 residues

Properties of domains

Very short domains, less than 40 residues, are often stabilised by metal ions or disulfide bonds.

Larger domains, greater than 300 residues, are likely to consist of multiple hydrophobic core

So, you have a sequence...

...look it up in existing database INTERPROSCAN:

http://www.ebi.ac.uk/Tools/InterProScan/ PSI-BLAST: http://www.ncbi.nlm.nih.gov/BLAST GTG: http://ekhidna.biocenter.helsinki.fi/gtg/start

...search against existing family descriptions PFAM: http://pfam.sanger.ac.uk/ SUPERFAMILY: http://supfam.org/SUPERFAMILY/

http://www.ebi.ac.uk/Tools/InterProScan/

http://www.ebi.ac.uk/Tools/InterProScan/

http://www.ncbi.nlm.nih.gov/BLAST

http://www.ncbi.nlm.nih.gov/BLAST



http://www.sanger.ac.uk/Software/Pfam

http://supfam.org/SUPERFAMILY/



classifying the protein universe synapse- associated protein 97 wu et al, 2002. embo j 19:5740-5751

Documents