classifying the protein universe synapse- associated protein 97 wu et al, 2002. embo j 19:5740-5751

40
Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Upload: claude-smith

Post on 17-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Classifying the protein universe

Synapse-Associated Protein 97

Wu et al, 2002. EMBO J 19:5740-5751

Page 2: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Domain Analysis and Protein Families

Introduction What are protein families?

Motifs and Profiles The modular architecture of proteins

Domain Properties and Classification

Page 3: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Protein Families

Protein families are defined by homology In a family, everyone is related to everyone Everybody in a family shares a common

ancestor

Protein family 1 Protein family 2

Page 4: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Homology versus Similarity

Homologous proteins have similar 3D structures and (usually) share common ancestry

1chg

1sgt

1chg

1sgt

Superfamily: Trypsin-like Serine Proteases

1chg and 1sgt 31% identity, 43% similarity

We can infer homology from similarity!

Page 5: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Homology versus Similarity

But Homologous proteins may not share sequence similarity

1chg

1sgc

1chg

1sgc

Superfamily: Trypsin-like Serine Proteases

1chg and 1sgc 15% identity, 25% similarityWe cannot infer similarity from homology

Page 6: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Homology versus Similarity

Similar sequences may not have structural similarity

1chg

1chg

2baa

2baa

1chg and 2baa 30% similarity, 140/245 aaWe cannot assume homology from similarity!

Page 7: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Homology versus Similarity

Summary Sequences can be similar without being homologous Sequences can be homologous without being similar

Evolution /Homology

BLASTSimilarit

y

Families ??

Page 8: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Domain Analysis and Protein Families

Introduction What are protein families?

Motifs and Profiles The modular architecture of proteins

Domain Properties and Classification

Page 9: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Technique to identify protein family

Search for profiles/motifs of biological significance that categorize a protein into a family

Pattern (motif) - a deterministic syntax that describes multiple combinations of possible residues within a protein string

Profile - probabilistic generalizations that assign to every segment position, a probability that each of the 20 aa will occur

Intermediate sequence search - link many profile searches

Page 10: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Automated Motif Discovery

Given a set of sequences: GIBBS Sampler

http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?data_type=protein

MEME - motif-based sequence analysis tools http://meme.sdsc.edu/meme/

PRATT - tool to discover patterns that are conserved in a set of protein sequences http://kr.expasy.org/tools/pratt/ http://www.ebi.ac.uk/pratt (advanced tool)

TEIRESIAS http://cbcsrv.watson.ibm.com/Tspd.html Combinatorial output

Page 11: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Motif Description of a Protein Family

Regular expressions:

........C.............S...L..I..DRY..I.......................W... I E W V

/ C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W /

Page 12: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Automated Profile Generation

Any multiple alignment is a profile!

PSI-BLAST Algorithm:

1. Start from a single query sequence2. Perform BLAST search3. Build profile of neighbours4. Repeat from 2 …

Very sensitive method for database search

Page 13: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

PSI-BLAST

Position Specific Iterative BlastPSI-Blast profile models only positions in the query

sequence

Threshold for inclusion in profile

Query Profile1 Profile2

...After n iterations

Page 14: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

HMMs

Hidden Markov Models are Statistical methods that consider all the possible combinations of matches, mismatches, and gaps to generate a consensus (Higgins, 2000)

Page 15: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Using HMMs

You can use HMM to create a model profile/PSSM (Position Specific Scoring Matrix) To create one, you need to have an multiple

alignment The more sequences in the multiple alignment, the

better the model created by HMM will be After creating HMM model, you can search a

database with it (Eg: PFAM)

Page 16: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

HMM libraries

PFAM http://pfam.sanger.ac.uk The Pfam database is a large collection of protein

families, each represented by multiple sequence alignments and hidden Markov models (HMMs).

Pfam-A entries are high quality, manually curated families.

Pfam-B entries are generated automatically.

Page 17: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

GTG

Graph clustering algorithm in which all known protein sequences simultaneously self-organize into hypothetical multiple sequence alignments Eliminates noise Enables fast sequence database searching methods

which are superior to profile-profile comparison at recognizing distant homologues

Page 18: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

GTG steps

1. Generate alignment trace graph• Nodes = residues• Edges = aligned in PSI-Blast library• Unweighted

2. Edge weighting• Using consistency

3. Clustering • Driven by consistency• Single site occupancy rule

4. Post-processing• Generate non-redundant set of inter-cluster

edges• Identify sub-trees with conserved residues

Page 19: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Alignment trace graph

Protein 1Protein 2Protein 3Protein 4Protein 5

- Graph representation of input pairwise alignment data- Vertices = residues- Edges = aligned in a pairwise alignment from input library

Residues more residues

Page 20: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Consistency = neighbour overlap

i j

Weight = intersection / union

Page 21: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

GTG – global trace graph

Input: PSI-Blast all versus all alignments in NRDB40

Output: superalignment of all proteins Applications

Pairwise alignment of query and target sequences

Transitive sequence database searching (fast) Tracking conserved residues (feature space)

Page 22: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Edge weight = consistency (fraction of common neighbours) Cluster ≈ hypothetical column of multiple alignment (single site

occupancy)

Cluster 2Cluster 1

Protein 1Protein 2Protein 3Protein 4Protein 5

Protein 1Protein 2Protein 3Protein 4Protein 5

Alignment trace graph

Page 23: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

‘Motif tracking’

K

K

K

K

K

A

A

A

H

A

G

A

K

consistency

consistency

consistency

K

K

Each vertex is labelled with source protein and position in sequence.Motifs are subtrees enriched in one particular amino acid type.

Page 24: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Remote homolog detectionbased on GTG alignment score

Lindahl benchmark, superfamily level

0 20 40 60

*SPARKS-0

GTG-DEEP

*FOLDpro

*SP3

GTG-LOCAL

*Prospect II

*Fugue

Blastlink

SAM-T98

PSI-Blast

Ssearch

HMMer

% correct

top-1

top-5

GTG clustering is informative; detect as many remote homologs as threading methods

Page 25: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

GTG summary

Super-families form elongated clusters in “protein space” Profile models fluctuations around an equilibrium

pointConsistency ~ path model

Exploits multiple profile models Discriminative in database searching

Global trace graph data structure Feature space for pattern discovery

http://ekhidna.biocenter.helsinki.fi/gtg/start

Page 26: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Relationships between families

Pfam clans A clan is a collection of Pfam-A entries which are related by

similarity of sequence, structure or profile-HMM Superfamily

http://supfam.cs.bris.ac.uk/SUPERFAMILY/hmm.html The sequence search method uses a library (covering all

proteins of known structure) consisting of 1776 SCOP superfamilies from classes a to g. Each superfamily is represented by a group of hidden Markov models.

Pfam-squared Based on GTG comparisons of representative sequences from

each PFAM-A family against all PFAM-A families. Rules of thumb: motif score>1000 means probably related,

motif score >500 means possibly related, score <500 means dubious

Page 27: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Benchmarking a motif/profile

You have a description of a protein family, and you do a database search…

Are all hits truly members of your protein family?

Benchmarking:

Datasetunknown

family membernot a family member

TP: true positiveTN: true negativeFP: false positiveFN: false negative

Result

Page 28: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Benchmarking a motif/profile

Precision / Selectivity Precision = TP / (TP + FP)

Sensitivity / Recall Sensitivity = TP / (TP + FN)

Balancing both: Precision ~ 1, Recall ~ 0: easy but useless Precision ~ 0, Recall ~ 1: easy but useless Precision ~ 1, Recall ~ 1: perfect but very difficult

Page 29: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Domain Analysis and Protein Families

Introduction What are protein families?

Motifs and Profiles The modular architecture of

proteins Domain Properties and

Classification

Page 30: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

The Modular Architecture of Proteins

BLAST search of a multi-domain protein

Phosphoglycerate kinase Triosephosphate isomerase

Page 31: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

What are domains?

Functional - from experiments:

example: Decay Accelerating Factor (DAF) or CD55

Has six domains (units): 4x Sushi domain (complement

regulation)

1x ST-rich ‘stalk’

1x GPI anchor (membrane attachment)

PDB entry 1ojy (sushi domains only) P Williams et al (2003) Mapping CD55 Function. J Biol Chem 278(12):

10691-10696

Page 32: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

There is only so much we can conclude…

Classifying domains to aid structure prediction predict structural domains and molecular function

of the domain Classifying complete sequences

predicting molecular function of proteins, large scale annotation

Majority of proteins are multi-domain proteins

Page 33: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

What are domains?

Mobile module

Protein 1

Protein 2

Protein 3

Protein 4

Page 34: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Domains are...

Parts of protein sequences that can evolve, function, and exist independently of the rest of the protein chain

Each domain forms a compact three-dimensional structure and often can be independently stable and folded

Page 35: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Domains are...

...evolutionary building blocks: Families of evolutionarily-related sequence

segments Domain assignment often coupled with

classification To be precise,

we say: “protein family” we mean: “protein domain family”

Page 36: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Example: global alignment

Phthalate dioxygenase reductase (PDR_BURCE)

Toluene - 4 -monooxygenase electron transfer component (TMOF_PSEME)

Global alignment fails!Only aligns largest domain.

Page 37: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Sometimes domain stuctures are quite complex

PGBM_HUMAN: “Basement membrane-specificheparan sulphate proteoglycan core protein precursor”

http://pfam.sanger.ac.uk/protein/PGBM_HUMANhttp://au.expasy.org/cgi-bin/prosite/ScanView.cgi?scanfile=530255511812.scan.gzhttp://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html

45 domains of 7 different type, according to PROSITE

Page 38: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Properties of domains

Most domains size approx 75 – 200 residues

Page 39: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Properties of domains

Very short domains, less than 40 residues, are often stabilised by metal ions or disulfide bonds.

Larger domains, greater than 300 residues, are likely to consist of multiple hydrophobic core

Page 40: Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

So, you have a sequence...

...look it up in existing database INTERPROSCAN:

http://www.ebi.ac.uk/Tools/InterProScan/ PSI-BLAST: http://www.ncbi.nlm.nih.gov/BLAST GTG: http://ekhidna.biocenter.helsinki.fi/gtg/start

...search against existing family descriptions PFAM: http://pfam.sanger.ac.uk/ SUPERFAMILY: http://supfam.org/SUPERFAMILY/