classifying the protein universe synapse- associated protein 97 wu et al, 2002. embo j 19:5740-5751
TRANSCRIPT
Classifying the protein universe
Synapse-Associated Protein 97
Wu et al, 2002. EMBO J 19:5740-5751
Domain Analysis and Protein Families
Introduction What are protein families?
Motifs and Profiles The modular architecture of proteins
Domain Properties and Classification
Protein Families
Protein families are defined by homology In a family, everyone is related to everyone Everybody in a family shares a common
ancestor
Protein family 1 Protein family 2
Homology versus Similarity
Homologous proteins have similar 3D structures and (usually) share common ancestry
1chg
1sgt
1chg
1sgt
Superfamily: Trypsin-like Serine Proteases
1chg and 1sgt 31% identity, 43% similarity
We can infer homology from similarity!
Homology versus Similarity
But Homologous proteins may not share sequence similarity
1chg
1sgc
1chg
1sgc
Superfamily: Trypsin-like Serine Proteases
1chg and 1sgc 15% identity, 25% similarityWe cannot infer similarity from homology
Homology versus Similarity
Similar sequences may not have structural similarity
1chg
1chg
2baa
2baa
1chg and 2baa 30% similarity, 140/245 aaWe cannot assume homology from similarity!
Homology versus Similarity
Summary Sequences can be similar without being homologous Sequences can be homologous without being similar
Evolution /Homology
BLASTSimilarit
y
Families ??
Domain Analysis and Protein Families
Introduction What are protein families?
Motifs and Profiles The modular architecture of proteins
Domain Properties and Classification
Technique to identify protein family
Search for profiles/motifs of biological significance that categorize a protein into a family
Pattern (motif) - a deterministic syntax that describes multiple combinations of possible residues within a protein string
Profile - probabilistic generalizations that assign to every segment position, a probability that each of the 20 aa will occur
Intermediate sequence search - link many profile searches
Automated Motif Discovery
Given a set of sequences: GIBBS Sampler
http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?data_type=protein
MEME - motif-based sequence analysis tools http://meme.sdsc.edu/meme/
PRATT - tool to discover patterns that are conserved in a set of protein sequences http://kr.expasy.org/tools/pratt/ http://www.ebi.ac.uk/pratt (advanced tool)
TEIRESIAS http://cbcsrv.watson.ibm.com/Tspd.html Combinatorial output
Motif Description of a Protein Family
Regular expressions:
........C.............S...L..I..DRY..I.......................W... I E W V
/ C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W /
Automated Profile Generation
Any multiple alignment is a profile!
PSI-BLAST Algorithm:
1. Start from a single query sequence2. Perform BLAST search3. Build profile of neighbours4. Repeat from 2 …
Very sensitive method for database search
PSI-BLAST
Position Specific Iterative BlastPSI-Blast profile models only positions in the query
sequence
Threshold for inclusion in profile
Query Profile1 Profile2
...After n iterations
HMMs
Hidden Markov Models are Statistical methods that consider all the possible combinations of matches, mismatches, and gaps to generate a consensus (Higgins, 2000)
Using HMMs
You can use HMM to create a model profile/PSSM (Position Specific Scoring Matrix) To create one, you need to have an multiple
alignment The more sequences in the multiple alignment, the
better the model created by HMM will be After creating HMM model, you can search a
database with it (Eg: PFAM)
HMM libraries
PFAM http://pfam.sanger.ac.uk The Pfam database is a large collection of protein
families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
Pfam-A entries are high quality, manually curated families.
Pfam-B entries are generated automatically.
GTG
Graph clustering algorithm in which all known protein sequences simultaneously self-organize into hypothetical multiple sequence alignments Eliminates noise Enables fast sequence database searching methods
which are superior to profile-profile comparison at recognizing distant homologues
GTG steps
1. Generate alignment trace graph• Nodes = residues• Edges = aligned in PSI-Blast library• Unweighted
2. Edge weighting• Using consistency
3. Clustering • Driven by consistency• Single site occupancy rule
4. Post-processing• Generate non-redundant set of inter-cluster
edges• Identify sub-trees with conserved residues
Alignment trace graph
Protein 1Protein 2Protein 3Protein 4Protein 5
- Graph representation of input pairwise alignment data- Vertices = residues- Edges = aligned in a pairwise alignment from input library
Residues more residues
Consistency = neighbour overlap
i j
Weight = intersection / union
GTG – global trace graph
Input: PSI-Blast all versus all alignments in NRDB40
Output: superalignment of all proteins Applications
Pairwise alignment of query and target sequences
Transitive sequence database searching (fast) Tracking conserved residues (feature space)
Edge weight = consistency (fraction of common neighbours) Cluster ≈ hypothetical column of multiple alignment (single site
occupancy)
Cluster 2Cluster 1
Protein 1Protein 2Protein 3Protein 4Protein 5
Protein 1Protein 2Protein 3Protein 4Protein 5
Alignment trace graph
‘Motif tracking’
K
K
K
K
K
A
A
A
H
A
G
A
K
consistency
consistency
consistency
K
K
Each vertex is labelled with source protein and position in sequence.Motifs are subtrees enriched in one particular amino acid type.
Remote homolog detectionbased on GTG alignment score
Lindahl benchmark, superfamily level
0 20 40 60
*SPARKS-0
GTG-DEEP
*FOLDpro
*SP3
GTG-LOCAL
*Prospect II
*Fugue
Blastlink
SAM-T98
PSI-Blast
Ssearch
HMMer
% correct
top-1
top-5
GTG clustering is informative; detect as many remote homologs as threading methods
GTG summary
Super-families form elongated clusters in “protein space” Profile models fluctuations around an equilibrium
pointConsistency ~ path model
Exploits multiple profile models Discriminative in database searching
Global trace graph data structure Feature space for pattern discovery
http://ekhidna.biocenter.helsinki.fi/gtg/start
Relationships between families
Pfam clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence, structure or profile-HMM Superfamily
http://supfam.cs.bris.ac.uk/SUPERFAMILY/hmm.html The sequence search method uses a library (covering all
proteins of known structure) consisting of 1776 SCOP superfamilies from classes a to g. Each superfamily is represented by a group of hidden Markov models.
Pfam-squared Based on GTG comparisons of representative sequences from
each PFAM-A family against all PFAM-A families. Rules of thumb: motif score>1000 means probably related,
motif score >500 means possibly related, score <500 means dubious
Benchmarking a motif/profile
You have a description of a protein family, and you do a database search…
Are all hits truly members of your protein family?
Benchmarking:
Datasetunknown
family membernot a family member
TP: true positiveTN: true negativeFP: false positiveFN: false negative
Result
Benchmarking a motif/profile
Precision / Selectivity Precision = TP / (TP + FP)
Sensitivity / Recall Sensitivity = TP / (TP + FN)
Balancing both: Precision ~ 1, Recall ~ 0: easy but useless Precision ~ 0, Recall ~ 1: easy but useless Precision ~ 1, Recall ~ 1: perfect but very difficult
Domain Analysis and Protein Families
Introduction What are protein families?
Motifs and Profiles The modular architecture of
proteins Domain Properties and
Classification
The Modular Architecture of Proteins
BLAST search of a multi-domain protein
Phosphoglycerate kinase Triosephosphate isomerase
What are domains?
Functional - from experiments:
example: Decay Accelerating Factor (DAF) or CD55
Has six domains (units): 4x Sushi domain (complement
regulation)
1x ST-rich ‘stalk’
1x GPI anchor (membrane attachment)
PDB entry 1ojy (sushi domains only) P Williams et al (2003) Mapping CD55 Function. J Biol Chem 278(12):
10691-10696
There is only so much we can conclude…
Classifying domains to aid structure prediction predict structural domains and molecular function
of the domain Classifying complete sequences
predicting molecular function of proteins, large scale annotation
Majority of proteins are multi-domain proteins
What are domains?
Mobile module
Protein 1
Protein 2
Protein 3
Protein 4
Domains are...
Parts of protein sequences that can evolve, function, and exist independently of the rest of the protein chain
Each domain forms a compact three-dimensional structure and often can be independently stable and folded
Domains are...
...evolutionary building blocks: Families of evolutionarily-related sequence
segments Domain assignment often coupled with
classification To be precise,
we say: “protein family” we mean: “protein domain family”
Example: global alignment
Phthalate dioxygenase reductase (PDR_BURCE)
Toluene - 4 -monooxygenase electron transfer component (TMOF_PSEME)
Global alignment fails!Only aligns largest domain.
Sometimes domain stuctures are quite complex
PGBM_HUMAN: “Basement membrane-specificheparan sulphate proteoglycan core protein precursor”
http://pfam.sanger.ac.uk/protein/PGBM_HUMANhttp://au.expasy.org/cgi-bin/prosite/ScanView.cgi?scanfile=530255511812.scan.gzhttp://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html
45 domains of 7 different type, according to PROSITE
Properties of domains
Most domains size approx 75 – 200 residues
Properties of domains
Very short domains, less than 40 residues, are often stabilised by metal ions or disulfide bonds.
Larger domains, greater than 300 residues, are likely to consist of multiple hydrophobic core
So, you have a sequence...
...look it up in existing database INTERPROSCAN:
http://www.ebi.ac.uk/Tools/InterProScan/ PSI-BLAST: http://www.ncbi.nlm.nih.gov/BLAST GTG: http://ekhidna.biocenter.helsinki.fi/gtg/start
...search against existing family descriptions PFAM: http://pfam.sanger.ac.uk/ SUPERFAMILY: http://supfam.org/SUPERFAMILY/