identification of protein domains
DESCRIPTION
Identification of Protein Domains. Eden Dror Menachem Schechter. Computational Biology Seminar 2004. Overview. Introduction to protein domains. Classification of homologs. Representing a domain. PSSM HMM Internet resources Pfam SMART PROSITE InterPro Research example. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/1.jpg)
Identification of Protein DomainsEden DrorMenachem Schechter
Computational Biology Seminar 2004
![Page 2: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/2.jpg)
Overview
• Introduction to protein domains.– Classification of homologs.
• Representing a domain.– PSSM– HMM
• Internet resources– Pfam– SMART– PROSITE– InterPro
• Research example.
![Page 3: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/3.jpg)
Protein domains
• A discrete portion of a protein assumed to fold independently, and possessing its own function.
• Mobile domain (“module”): a domain that can be found associated with different domain combinations in different proteins.
![Page 4: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/4.jpg)
Protein domains
• The assumption: The domain is the fundamental unit of protein structure and function.
• Protein family – all proteins containing a specific domain.
![Page 5: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/5.jpg)
What can we learn from them?
• Common ancestors & homology information of a set of proteins.
• Homology can induce properties of a protein like functionality & localization.
• Therefore, domains can be used to classify a new protein to a family, inferring functionality.
![Page 6: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/6.jpg)
Classification of homologs
• Homology is not a sufficiently well-defined term to describe the evolutionary relationships between genes.
• Homologous genes can be derived by two major ways: – Gene duplication (in the same species).– Speciation (splitting of one species into
two).
![Page 7: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/7.jpg)
Classification of homologs
![Page 8: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/8.jpg)
Classification of homologs
• Orthologs – Two genes from two different species that derive from a single gene in the last common ancestor of the species.
• Paralogs – Two genes that derive from a single gene that was duplicated within a genome.
![Page 9: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/9.jpg)
Classification of homologs
para
para
ortho
ortho
![Page 10: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/10.jpg)
Classification of homologs
• Inparalogs - paralogs that evolved by gene duplication after the speciation event.
• Outparalogs - paralogs that evolved by gene duplication before the speciation event.
![Page 11: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/11.jpg)
Classification of homologs
out-para
In-para
In-para
When comparing human with worm
![Page 12: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/12.jpg)
What can we learn from them?
• Ortholog proteins are evolutionary, and typically functional counterparts in different species.
• Paralog proteins are important for detecting lineage-specific adaptations.
• Both of them can reveal information on a specific species or a set of species.
![Page 13: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/13.jpg)
Protein domains – summary
• By identifying domains we can:
– infer functionality & localization of a protein.
– Learn on a specific species.– Learn on a set of species as a group.
![Page 14: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/14.jpg)
Domain representation
• Different methods to represent (model) domains:
• Patterns (regular expressions).• PSSM (Position specific score matrix).• HMM (Hidden Markov model).
![Page 15: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/15.jpg)
PSSM
• Position specific score matrix
• Score matrix representing the score for having each amino acid in a given position in a specific sequence.
• Based on the independent probabilities P(a|i) of observing amino acid a in position i.
![Page 16: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/16.jpg)
PSSM: Example
![Page 17: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/17.jpg)
PSSM: Identifying a domain
• Given a sequence and a PSSM:
• Run over all positions.• Score each sub-sequence according to
the matrix.
![Page 18: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/18.jpg)
HMM: Hidden Markov Model
• Markov model: a way of describing a process that goes through a series of states.
• Each state has a probability of transitioning to the other states.
• xi is a random variable of state.x1 x2 x3 x4
![Page 19: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/19.jpg)
HMM: Markov Model
• Example:• States are {0,1}
x1 =0 x2 =0 x3 =0 x4 =0
x1 =1 x2 =0 x3 =0 x4 =1
x1=0 x2=1 x3 =1 x4 =1
![Page 20: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/20.jpg)
HMM: Markov Model
)|(
8.02.0
4.06.0)(
1 ixjxPa
aA
kkij
ij
• Transition matrix:
x1 x2 x3 x4
x
![Page 21: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/21.jpg)
HMM: Markov Model
• State transition example:• States are the nucleotides A, T, G, C.
![Page 22: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/22.jpg)
HMM: Hidden Markov Model
• Hidden Markov model:• Each state x emits an output y, at a
specific probability.• We only know the output
(observations).• Thus, the states are hidden.
y1 y2 y3 y4
x1 x2 x3 x4
![Page 23: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/23.jpg)
HMM: Hidden Markov Model
• Example: states are {0,1}, output {0,1}
y1 =1 y2 =1 y3 =0 y4 =0
x1 =0 x2 =1 x3 =1 x4 =1
y1 =1 y2 =0 y3 =1 y4 =0
x1 =1 x2 =0 x3 =0 x4 =1
![Page 24: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/24.jpg)
HMM: Hidden Markov Model
y1 y2 y3 y4
x1 x2 x3 x4
)|(
15.085.0
9.01.0)(
ixjyPb
bB
kkij
ij
• Emission matrix:
x
y
![Page 25: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/25.jpg)
HMM: What can we do with it?
• Given (A, B):• Probability of given states and outputs
)|()|()|()()( 22121112121 xyPxxPxyPxPyyyxxxP nn
nxx
nnn yyyxxxPyyyP
1
)()( 212121
)|(max 2121 nn yyyxxxP
• Most likely sequence of states that generated a given output sequence
• Probability of a given output sequence
![Page 26: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/26.jpg)
HMM: What can we do with it?
• Learning:
• Given state and output sequences calculate the most probable (A, B).
• Easy when the states are known.
• Otherwise: use a training algorithm.
![Page 27: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/27.jpg)
HMM: Profile HMM
• Use HMM to represent sequence families.
• A particular type of HMM suited to modeling multiple alignments.
• (Assume we have a multiple alignment).
![Page 28: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/28.jpg)
HMM: Trivial profile HMM
• We begin with ungapped regions.
• Each position corresponds to a state.• Transitions are of probability 1.
![Page 29: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/29.jpg)
HMM: Trivial profile HMM
• Let ei(a) be the independent probability of observing amino acid a in position i.
• The probability of a new sequence x, according to the model:
)()|(1
ii
N
ixeMxP
![Page 30: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/30.jpg)
HMM: Trivial profile HMM
• We can score the sequence x:
• Where q indicates the probability under a random model.
ix
iiN
i q
xeS
)(log
1
![Page 31: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/31.jpg)
HMM: Trivial profile HMM
• Consider the values
• They behave like elements in a score matrix.
• The trivial profile HMM is equivalent to a PSSM.
ix
ii
q
xe )(log
![Page 32: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/32.jpg)
HMM: profile HMM
• Let’s untrivialize by allowing for gaps: insertions and deletions.
• Start off with the PSSM HMM.
![Page 33: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/33.jpg)
HMM: profile HMM
• Handling insertions:
• Introduce new states Ij – match insertions after position j.
• These states have random emission probabilities.
![Page 34: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/34.jpg)
HMM: profile HMM
• The score of a gap of length k:
jjjjjj IIMIIM akaa log)1(loglog1
![Page 35: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/35.jpg)
HMM: profile HMM
• Handling deletions:
• Introduce silent states Dj.
• These states do not emit.
![Page 36: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/36.jpg)
HMM: profile HMM
• The complete profile HMM:
![Page 37: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/37.jpg)
Internet resources
• Databases of protein families.• Family information and identification.
• Considerations:– Type of representation (pattern, PSSM,
HMM).– Choice of seed multiple alignment proteins.– Quality control.– Database features (links, annotations,
views).– Database Specificity (organism, functions).
![Page 38: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/38.jpg)
Pfam: Home
![Page 39: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/39.jpg)
Pfam
• Protein families database of alignments and HMMs
• Uses profile-HMMs to represent families.
• For each family in Pfam you can:– Look at multiple alignments – View protein domain architectures – Examine species distribution – Follow links to other databases – View known protein structures
![Page 40: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/40.jpg)
Pfam: Databases
2 databases:• Pfam-A – curated multiple alignments.
– Grows slowly.– Quality controlled by experts.
• Pfam-B – automatic clustering (ProDom derived).– Complements Pfam-A.– New sequences instantly incorporated.– Unchecked: false positives, etc.
![Page 41: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/41.jpg)
Pfam: Features
• Search by: Sequence, keyword, domain, taxonomy.
• Browsing by family or genome.
• Evolutionary tree
![Page 42: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/42.jpg)
Pfam: Construction
• Source of seed alignments:– Pfam-B families.– Published articles.– 'domain hunting' studies.– occasionally using entries from other
databases (e.g. MEROPS for peptidases).
![Page 43: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/43.jpg)
Pfam: Domain information
![Page 44: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/44.jpg)
Pfam: Domain organization
![Page 45: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/45.jpg)
Pfam: Multiple alignment
![Page 46: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/46.jpg)
Pfam: HMM logo
![Page 47: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/47.jpg)
Pfam: Species distribution
![Page 48: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/48.jpg)
Pfam: Genome comparison
![Page 49: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/49.jpg)
PROSITE
• Database of protein families.
• Matching according to simple patterns or PSSM profiles.
• Browsing all proteins of a specific family.
• Latest release knows 1696 protein families.
![Page 50: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/50.jpg)
PROSITE: Features
• Comprehensive domain documentation.• All profile matches checked by experts.• Specificity/sensitivity:• Specificity: true-pos/all-pos• Sensitivity: true-pos/(true-pos + false-
neg)
![Page 51: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/51.jpg)
PROSITE: Example
• Specificity of Zinc finger C2H2 type domain
![Page 52: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/52.jpg)
SMART
![Page 53: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/53.jpg)
SMART
• Simple Modular Architecture Research Tool
• Identification and annotation of genetically mobile domains and the analysis of domain architectures.
• SMART consists of a library of HMMs.
• Knows 665 HMMs to date.
![Page 54: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/54.jpg)
SMART: Features
• finding proteins containing specific domains i.e. of the same family
• Function prediction• Sub-cellular localization• Binding partners• Architecture• Alternative splicing information• Orthology information
![Page 55: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/55.jpg)
SMART: Domain selection example
Tyrosine kinase (TyrKc) AND Transmembrane region (TRANS)
![Page 56: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/56.jpg)
InterPro
• InterPro combines 9 other databases such as SMART, Pfam, Prodom and more.
• Queries can use many different methods (as the other databases use different methods).
• However, thresholds are predefined and cannot be changed for those methods.
![Page 57: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/57.jpg)
InterPro
• Provides more results, but can sometimes be redundant.
• Coverage statistics:• 93% of Swiss-Prot v42.5 –
128540 out of 138922 proteins• 81% of TrEMBL v25.5 –
819966 out of 1013263 proteins
![Page 58: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/58.jpg)
InterPro: Features
• Searching by Protein/DNA sequences
• Finding domains & homologs
• List of InterPro entries of type: – Family– Domain– Repeat– PTM- Post Transcriptional modifications– Binding Site– Active Site– Keyword
![Page 59: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/59.jpg)
InterPro: Example
• Kringle domain
![Page 60: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/60.jpg)
Research Example: Introduction
• Goal: The systematic identification of novel protein domain families.
• Using computational methods.
![Page 61: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/61.jpg)
Research Example: Method
Derive set of 107 nuclear domains
extract proteins
Extract unannotated regions
Cluster sequences
Take longest member
PSI-BLAST
Investigate homologous regions
Manual confirmation
![Page 62: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/62.jpg)
Research Example: Results
• 28 New Domains identified:
• 15 domains in diverse contexts, in different species.
• 3 domains species specific.• 7 domains with weak similarity to
previously described domains.• 3 extension domains.
![Page 63: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/63.jpg)
Predictions of Function
• On the basis of reports in literature and/or occurrence with other identified domains, functional features can be predicted for our novel domain families.
• Examples:– Chromatin binding– Protein Interaction– Predicted sub-cellular localization
![Page 64: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/64.jpg)
Predictions of Function:Chromatin-Binding example
• The novel domain CSZ is contained in protein SPT6, which regulates transcription via chromatin structure modification.
• SPT6 has a histone-binding capability, experimentally confirmed.
• Other domains (S1, SH2) in SPT6 are unlikely to bind histones or chromatin.
• Conclusion: CSZ has a predicted histone binding function.
![Page 65: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/65.jpg)
Predictions of Function:Localization example• Some of the novel domains are only
found within proteins from the initial set of nuclear domains.
• This predicts that these domains have a nuclear function.
• The other domains are likely to have roles in both nucleus and cytoplasm.
![Page 66: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/66.jpg)
Conclusion
• Domains are the functional units of proteins.• Identifying a domain within a new protein may
teach us much about it.
• There are several types of models to represent domains.
• These models can also be used to identify the domain they represent.
• Many Internet databases available to catalogue and identify families.
• Protocol to identify new domains using old ones.
![Page 67: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/67.jpg)
Resources
• Pfam:http://www.sanger.ac.uk/Software/Pfam/
• SMART: http://smart.embl-heidelberg.de/
• PROSITE:http://www.expasy.org/prosite/
• InterPro:http://www.ebi.ac.uk/interpro/
![Page 68: Identification of Protein Domains](https://reader035.vdocuments.us/reader035/viewer/2022081505/56815d4f550346895dcb59d4/html5/thumbnails/68.jpg)
The End