homology modeling - biojuncture · homology modeling - applications structure-based assessment of...

Homology Modeling

Roberto LinsEPFL - summer semester 2005

Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton, Bioinformatics, genes, protein &computers; A.M. Lesk, Introduction to Bioinformatics; A.D. Baxevanis & B.F. Ouellette, Bioinformatics, a practical guide to the analysis of genes and proteins; several online

materials (George Washington University, University of Houston, Tel-Aviv University) and resources (RCSB, NCBI, SWISS-PROT) as well as personal research data.

TERTIARY STRUCTURE (fold)TERTIARY STRUCTURE (fold)

Genome

Expressome

Proteome

Metabolome

Functional GenomicsFunctional Genomics

algorithmdatabase

algorithm

algorithm

algorithm

database

database

database

Annotated proteins in the databank: ~ 100,000

Limitations of Experimental MethodsLimitations of Experimental Methods

Proteins with known structure: ~5,000 !

Total number including ORFs: ~ 700,000

ORF, or Open Reading Frame, is a region of genome that codesfor a protein

Have been identified by whole genome sequencing effortsORFs with no known function are termed orphan

Datasetfor analysis

Structural Biology Consortia:Structural Biology Consortia:Brute Force Approach Towards Structure ElucidationBrute Force Approach Towards Structure Elucidation

Employment of a Ph.Ds & Postdocs army

Aim to solve about 400 structures a year

Large-scale expression & crystallization attempts

++

–– Basic strategies remain the same

No (known) new tricks

**

Enhances the statistical base for inferring sequence– structure relationships

“Unrelenting” ones will be ignored

Can we predict structure from sequence?

GCTCCTCACTGTCTGTGTTTATTCTTTTAGCTTCTTCAGATCTTTTAGTCTGAGGAAGCCTGGCATGTGCAAATGAAGTTAACCTAA...

Structure is much more conserved than sequenceduring evolution

Comparative ModelingComparative Modeling(Homology Modeling)(Homology Modeling)

BasisBasis

Higher the similarity, higher is theconfidence in the modeled structure

Limited applicabilityLimited applicability

A large number of proteins and ORFs have no similarityto proteins with known structure

What’s homology modeling?Predicts the three-dimensional structure of a given proteinsequence (target) based on an alignment to one or more knownprotein structures (templates).

If similarity between the target sequence and the templatesequence is detected, structural similarity can be assumed.

In general, 30% sequence identity is required to generate an usefulmodel.

It can be used to understand function, activity, specificity, etc.

It is of interest to drug companies wishing to do structure-aideddrug design

A keystone of structural proteomics

Homology modeling - applications

Structure-based assessment of target drugability

Structure-guided design of mutagenesis experiments

Tool compound design for probing biological function

Homology model based ligand design

Design of in vitro test assays

Structure-based prediction of drug metabolism and toxicity

Accuracy and application of protein structure

Does sequence similarity impliesstructure similarity?

Twilight zone

Safe zone (thanks to evolution!)

RMSD

of

back

bone

ato

ms

(Ǻ)

% identical residues in core

0.0

0.5

2.5

2.0

1.5

1.0

100 75 50 25 0

Chotia & Lesk, 1986

Natoms

d

RMSD

Natoms

i

i!== 1

2

Natoms = total number of atoms; di = distance between the coordinates of anatom i at t0 and tn , when the structures are superimposed.

My target sequence has over 30% sequence identitywith a known protein structure, so I want to generate

a 3D model.

What do I have to do?

Structure prediction by homology modeling

– The structure of a protein is determined by its primaryamino acid sequence (Anfinsen).

– During evolution, the structure of protein a has changedmuch slower than its sequence.

• Similar sequences adopt identical structures anddistantly related sequences fold into similarstructures.

Homology modeling makes two fundamental assumptions

1) Template recognition & initial alignment

2) Alignment correction

3) Backbone generation

4) Loop modeling

5) Side-chain modeling

6) Model optimization

7) Model validation

In summary: homology modeling steps

Template recognition & initial alignment

Select the best template from a library of known protein structuresderived from the PDB

Templates can be found using the target sequence as a query forsearching using FASTA or BLAST

Gaining confidence in template searching

Once a suitable template is found, a literature search on therelevant fold can determine what biological role it plays

Does this match the biological/biochemical function that youexpect?

Ligand(s) present?

Resolution of the template

Family of Proteins

Multiple templates?

Further Considerations:

duplication

speciation

species 1 species 2

paralogues

orthologues

Function may berelated or verydifferent!

Function more likely to be conserved

Proteins are homologous if they are related by divergence from a common ancestor

In summary: there are two types of homologous

- Orthologs: proteins that carry out the same function in differentspecies -Paralogs: proteins that perform different, but related functionswithin one organism

Alignment of the target onto the template

Correct alignment is necessary to create the most probable 3Dstructure of the target

If sequences aligns incorrectly, it will result in false positive ornegative results

Important to consider:- algorithms- scoring alignments- gap penalties

Identity SCRs (Structure Conserved Regions and SVRs(Structure Variable Regions)

The (true) alignment indicates the evolutionary processgiving rise to the different sequences starting from thesame ancestor sequence and then changing throughmutations (insertions, deletions, and substitutions)

Alignment Outcome

Alignment vs. databases

Task: given a query sequence and millions of databaserecords, find the optimal alignment between thequery and a record

AGTCTCCAGTTATGCCA…

Alignment vs. databases

Tool: given two sequences, there exists an algorithm to find thebest alignment.

Naïve solution: apply algorithm to each of the records, one by one.

Problem: an exact algorithm is just too slow to run millions oftimes (even linear time algorithm will run slowly on a hugedatabase).

Solution: - run in parallel (expensive)- use of a fast (heuristic) method to discard

irrelevant records and the apply the exact algorithm to theremaining few

Sequence alignment algorithms

Used to calculate a similarity score to infer sequence homologybetween two sequences

Examples: the two most used in homology modeling are:

BLAST: General strategy is to optimise the maximal segmentpair (MSP) score - BLAST computes similarity, not alignment(Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J., J. Mol. Biol.(1990) 215:403-410)

FastA (local alignment): searches for both full and partialsequence matches, i.e., local similarity obtained; more sensitivethan BLAST, but slower; many gaps may represent a problem(Pearson, W. R., Lipman, D. J., P.N.A.S. (1988) 85:2444-2448).

Sequence alignment outputsFa

stA

BLA

ST

Alignment correctionsAlignments are scored (substitution score) in order to definesimilarity between 2 aa residues in the sequences

A substitutions score is calculated for each aligned pair of letters.

Substitution matrices:

- reflect the true probabilities of mutations occurringthrough a period of evolution

- PAM family: based on global aligments of closely relatedproteins. Mutation probability matrix.

- BLOSUM family: based on observed alignments, noextrapolation of sequences that are related.

Gap is one or more empty spaces in one sequence aligned withletters in the other sequence

Gap Penalties

These empty spaces may or may not be treated as penalties:

- higher penalty score is assigned for the first missing aa then thesubsequent ones; it considers the fact that each mutational eventcan insert or delete many residues at a time

Gap Penalties

N

C

Insertion/deletion of structural domains can ‘easily’ be done at loop sites

Gap Penalties

Gap Penalties

The overall alignment score is the sum of similarity and gap scores:

the higher the overall alignment score, the better the alignment(more conserved)

Corrections by hand may still be needed!

Multiple nucleotide or amino sequence alignment techniques areusually performed to fit one of the following scopes :

-to characterize protein families, identify shared regions ofhomology in a multiple sequence alignment; (this happens generallywhen a sequence search revealed homologies to several sequences) ;

-to determine the consensus sequence of several aligned sequences;

-to help prediction of the secondary and tertiary structures of newsequences;

- preliminary step in molecular evolution analysis using Phylogeneticmethods for constructing phylogenetic trees.

Multiple Sequence Alignments

Backbone generation

Uses known structurally conserved regions to generate coordinatesfor the unknown

For SCRs - copy coordinates from known structures

For variable regions (VR) - copy from known structure, if theresidue types are similar; otherwise, use databases forfragtmented loop sequences.

Backbone generation

Template-based fragment assembly

a) Find structurally conserved regionsb) build model core

Loop modeling

Loop modeling

1. Database search for segments from known protein structuresfitting fixed end-points2. Molecular mechanics/molecular dynamics3. Combination of 1+2

Loop modeling

Ab initio rebuilding (e.g., Monte Carlo, MD, etc) to build missing loops

Side chain modeling1. Use of rotamer libraries (backbone dependent)

2. Molecular mechanics optimization- Dead-end elimination (heuristic)- Monte Carlo (heuristic)- Branch & Bound (exact)

3. Mean-field methods

Model optimizationMolecular mechanics methods

Model validation/evaluationModel should be evaluated for:

- correctness of the overall fold/structure- errors over localized regions- stereochemical parameters: bond lengths, angles, etc

Some softwares for model verification:

- Procheck http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html-WHAT IF http://swift.cmbi.kun.nl/whatif-PROSA II http://www.came.sbg.ac.at/Services/prosa.html-Profile 3D & Verify 3D http://shannon.mbi.ucla.edu/DOE/Services

Model validation/evaluation

The Ramachandran plot


Profile 3D & Verify 3D:

-verify newly solved structures or homology models-find structures/folds compatible with a given sequence-find sequences compatible with known structure/fold from adatabase of sequences

homology modeling - biojuncture · homology modeling - applications structure-based assessment of...

Documents