homology modeling - biojuncture · homology modeling - applications structure-based assessment of...
TRANSCRIPT
Homology Modeling
Roberto LinsEPFL - summer semester 2005
Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton, Bioinformatics, genes, protein &computers; A.M. Lesk, Introduction to Bioinformatics; A.D. Baxevanis & B.F. Ouellette, Bioinformatics, a practical guide to the analysis of genes and proteins; several online
materials (George Washington University, University of Houston, Tel-Aviv University) and resources (RCSB, NCBI, SWISS-PROT) as well as personal research data.
TERTIARY STRUCTURE (fold)TERTIARY STRUCTURE (fold)
Genome
Expressome
Proteome
Metabolome
Functional GenomicsFunctional Genomics
algorithmdatabase
algorithm
algorithm
algorithm
database
database
database
Annotated proteins in the databank: ~ 100,000
Limitations of Experimental MethodsLimitations of Experimental Methods
Proteins with known structure: ~5,000 !
Total number including ORFs: ~ 700,000
ORF, or Open Reading Frame, is a region of genome that codesfor a protein
Have been identified by whole genome sequencing effortsORFs with no known function are termed orphan
Datasetfor analysis
Structural Biology Consortia:Structural Biology Consortia:Brute Force Approach Towards Structure ElucidationBrute Force Approach Towards Structure Elucidation
Employment of a Ph.Ds & Postdocs army
Aim to solve about 400 structures a year
Large-scale expression & crystallization attempts
++
–– Basic strategies remain the same
No (known) new tricks
**
Enhances the statistical base for inferring sequence– structure relationships
“Unrelenting” ones will be ignored
Can we predict structure from sequence?
GCTCCTCACTGTCTGTGTTTATTCTTTTAGCTTCTTCAGATCTTTTAGTCTGAGGAAGCCTGGCATGTGCAAATGAAGTTAACCTAA...
Structure is much more conserved than sequenceduring evolution
Comparative ModelingComparative Modeling(Homology Modeling)(Homology Modeling)
BasisBasis
Higher the similarity, higher is theconfidence in the modeled structure
Limited applicabilityLimited applicability
A large number of proteins and ORFs have no similarityto proteins with known structure
What’s homology modeling?Predicts the three-dimensional structure of a given proteinsequence (target) based on an alignment to one or more knownprotein structures (templates).
If similarity between the target sequence and the templatesequence is detected, structural similarity can be assumed.
In general, 30% sequence identity is required to generate an usefulmodel.
It can be used to understand function, activity, specificity, etc.
It is of interest to drug companies wishing to do structure-aideddrug design
A keystone of structural proteomics
Homology modeling - applications
Structure-based assessment of target drugability
Structure-guided design of mutagenesis experiments
Tool compound design for probing biological function
Homology model based ligand design
Design of in vitro test assays
Structure-based prediction of drug metabolism and toxicity
Accuracy and application of protein structure
Does sequence similarity impliesstructure similarity?
Twilight zone
Safe zone (thanks to evolution!)
RMSD
of
back
bone
ato
ms
(Ǻ)
% identical residues in core
0.0
0.5
2.5
2.0
1.5
1.0
100 75 50 25 0
Chotia & Lesk, 1986
Natoms
d
RMSD
Natoms
i
i!== 1
2
Natoms = total number of atoms; di = distance between the coordinates of anatom i at t0 and tn , when the structures are superimposed.
My target sequence has over 30% sequence identitywith a known protein structure, so I want to generate
a 3D model.
What do I have to do?
Structure prediction by homology modeling
– The structure of a protein is determined by its primaryamino acid sequence (Anfinsen).
– During evolution, the structure of protein a has changedmuch slower than its sequence.
• Similar sequences adopt identical structures anddistantly related sequences fold into similarstructures.
Homology modeling makes two fundamental assumptions
1) Template recognition & initial alignment
2) Alignment correction
3) Backbone generation
4) Loop modeling
5) Side-chain modeling
6) Model optimization
7) Model validation
In summary: homology modeling steps
Template recognition & initial alignment
Select the best template from a library of known protein structuresderived from the PDB
Templates can be found using the target sequence as a query forsearching using FASTA or BLAST
Gaining confidence in template searching
Once a suitable template is found, a literature search on therelevant fold can determine what biological role it plays
Does this match the biological/biochemical function that youexpect?
Ligand(s) present?
Resolution of the template
Family of Proteins
Multiple templates?
Further Considerations:
duplication
speciation
species 1 species 2
paralogues
orthologues
Function may berelated or verydifferent!
Function more likely to be conserved
Proteins are homologous if they are related by divergence from a common ancestor
In summary: there are two types of homologous
- Orthologs: proteins that carry out the same function in differentspecies -Paralogs: proteins that perform different, but related functionswithin one organism
Alignment of the target onto the template
Correct alignment is necessary to create the most probable 3Dstructure of the target
If sequences aligns incorrectly, it will result in false positive ornegative results
Important to consider:- algorithms- scoring alignments- gap penalties
Identity SCRs (Structure Conserved Regions and SVRs(Structure Variable Regions)
The (true) alignment indicates the evolutionary processgiving rise to the different sequences starting from thesame ancestor sequence and then changing throughmutations (insertions, deletions, and substitutions)
Alignment Outcome
Alignment vs. databases
Task: given a query sequence and millions of databaserecords, find the optimal alignment between thequery and a record
AGTCTCCAGTTATGCCA…
Alignment vs. databases
Tool: given two sequences, there exists an algorithm to find thebest alignment.
Naïve solution: apply algorithm to each of the records, one by one.
Problem: an exact algorithm is just too slow to run millions oftimes (even linear time algorithm will run slowly on a hugedatabase).
Solution: - run in parallel (expensive)- use of a fast (heuristic) method to discard
irrelevant records and the apply the exact algorithm to theremaining few
Sequence alignment algorithms
Used to calculate a similarity score to infer sequence homologybetween two sequences
Examples: the two most used in homology modeling are:
BLAST: General strategy is to optimise the maximal segmentpair (MSP) score - BLAST computes similarity, not alignment(Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J., J. Mol. Biol.(1990) 215:403-410)
FastA (local alignment): searches for both full and partialsequence matches, i.e., local similarity obtained; more sensitivethan BLAST, but slower; many gaps may represent a problem(Pearson, W. R., Lipman, D. J., P.N.A.S. (1988) 85:2444-2448).
Sequence alignment outputsFa
stA
BLA
ST
Alignment correctionsAlignments are scored (substitution score) in order to definesimilarity between 2 aa residues in the sequences
A substitutions score is calculated for each aligned pair of letters.
Substitution matrices:
- reflect the true probabilities of mutations occurringthrough a period of evolution
- PAM family: based on global aligments of closely relatedproteins. Mutation probability matrix.
- BLOSUM family: based on observed alignments, noextrapolation of sequences that are related.
Gap is one or more empty spaces in one sequence aligned withletters in the other sequence
Gap Penalties
These empty spaces may or may not be treated as penalties:
- higher penalty score is assigned for the first missing aa then thesubsequent ones; it considers the fact that each mutational eventcan insert or delete many residues at a time
Gap Penalties
N
C
Insertion/deletion of structural domains can ‘easily’ be done at loop sites
Gap Penalties
Gap Penalties
The overall alignment score is the sum of similarity and gap scores:
the higher the overall alignment score, the better the alignment(more conserved)
Corrections by hand may still be needed!
Multiple nucleotide or amino sequence alignment techniques areusually performed to fit one of the following scopes :
-to characterize protein families, identify shared regions ofhomology in a multiple sequence alignment; (this happens generallywhen a sequence search revealed homologies to several sequences) ;
-to determine the consensus sequence of several aligned sequences;
-to help prediction of the secondary and tertiary structures of newsequences;
- preliminary step in molecular evolution analysis using Phylogeneticmethods for constructing phylogenetic trees.
Multiple Sequence Alignments
Backbone generation
Uses known structurally conserved regions to generate coordinatesfor the unknown
For SCRs - copy coordinates from known structures
For variable regions (VR) - copy from known structure, if theresidue types are similar; otherwise, use databases forfragtmented loop sequences.
Backbone generation
Template-based fragment assembly
a) Find structurally conserved regionsb) build model core
Loop modeling
Loop modeling
1. Database search for segments from known protein structuresfitting fixed end-points2. Molecular mechanics/molecular dynamics3. Combination of 1+2
Loop modeling
Ab initio rebuilding (e.g., Monte Carlo, MD, etc) to build missing loops
Side chain modeling1. Use of rotamer libraries (backbone dependent)
2. Molecular mechanics optimization- Dead-end elimination (heuristic)- Monte Carlo (heuristic)- Branch & Bound (exact)
3. Mean-field methods
Model optimizationMolecular mechanics methods
Model validation/evaluationModel should be evaluated for:
- correctness of the overall fold/structure- errors over localized regions- stereochemical parameters: bond lengths, angles, etc
Some softwares for model verification:
- Procheck http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html-WHAT IF http://swift.cmbi.kun.nl/whatif-PROSA II http://www.came.sbg.ac.at/Services/prosa.html-Profile 3D & Verify 3D http://shannon.mbi.ucla.edu/DOE/Services
Model validation/evaluation
The Ramachandran plot
Model validation/evaluation
Model validation/evaluation
Profile 3D & Verify 3D:
-verify newly solved structures or homology models-find structures/folds compatible with a given sequence-find sequences compatible with known structure/fold from adatabase of sequences