molecular biology programme 2011 -...

39
Molecular Biology PhD Programme 2011 Molecular Biology Course 2011 Macromolecular Structure Determination Part IV: Validation Tim Grüne University of Göttingen Dept. of Structural Chemistry http://shelx.uni-ac.gwdg.de [email protected] Tim Grüne Macromolecular Structure Determination 1/39

Upload: lamkhuong

Post on 07-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Molecular Biology PhD Programme 2011

Molecular Biology Course 2011

Macromolecular Structure DeterminationPart IV: Validation

Tim GrüneUniversity of Göttingen

Dept. of Structural Chemistry

http://[email protected]

Tim Grüne Macromolecular Structure Determination 1/39

Molecular Biology PhD Programme 2011

Why do we Need Validation?

“The mistake so clearly illustrates [. . . ] that those lovely colored ribbons festooning the covers andpages of journals are just models, not data[. . . ]”

C. Miller, Science 2007, 315, p. 459 about the “Great Pentaretraction”

“This showed that the structures of MsbA and EmrE were incorrect [. . . ]. In this case, unfortunately itappears that the incorrect structures have had serious adverse effects on the development of the fieldand possibly also on the distribution of grant money.”

A.M. Davis, S. A. St-Gallen, G. J. Kleywegt, Drug Discovery Today (2008), Vol. 13, pp. 831-841

Tim Grüne Macromolecular Structure Determination 2/39

Molecular Biology PhD Programme 2011

Creativity in Crystallography

w50

w50

post

ind

n

w0

37°C20°C

fractions 23−30 P F/T

F/T

fractions 1−6 Mar

ker

Mar

ker

S/N

S/N

P post

TE

V

Tim Grüne Macromolecular Structure Determination 3/39

Molecular Biology PhD Programme 2011

Creativity in Crystallography

w50

w50

post

ind

n

w0

37°C20°C

fractions 23−30 P F/T

F/T

fractions 1−6 Mar

ker

Mar

ker

S/N

S/N

P post

TE

V

Tim Grüne Macromolecular Structure Determination 4/39

Molecular Biology PhD Programme 2011

Creativity in Crystallography

w50

w50

post

ind

n

w0

37°C20°C

fractions 23−30 P F/T

F/T

fractions 1−6 Mar

ker

Mar

ker

S/N

S/N

P post

TE

V

Tim Grüne Macromolecular Structure Determination 5/39

Molecular Biology PhD Programme 2011

Crystallography is Seductive

• Crystallography has always been computer based.• Crystallography programs are well advanced and easy to use.• Atoms have no colour — there is no restriction to the crystallographer’s creativity and fantasy• It is easy to play around and “reassemble” various structures• Pictures easily stay in mind, and sometimes it is overlooked that the picture displays pure imagination.• It is becoming more and more difficult to publish a structure itself — there has to be “a story”.

Tim Grüne Macromolecular Structure Determination 6/39

Molecular Biology PhD Programme 2011

Caveat: Modelling Models

The structure of TBP, the TATA-box binding protein (TBP or TFIIDτ ) was published in 1992 (Nikolov et al.,Nature 360, pp.40–46). The shape of the molecule suggested that the TATA–box sits straight in the groove ofthe protein.

The structure of the complex, published a year later by Kim et al. (Nature 365, pp. 520–527) revealed that theDNA was actually heavily bent.

Tim Grüne Macromolecular Structure Determination 7/39

Molecular Biology PhD Programme 2011

Caveat: Modelling Models

The structure of TBP, the TATA-box binding protein (TBP or TFIIDτ ) was published in 1992 (Nikolov et al.,Nature 360, pp.40–46). The shape of the molecule suggested that the TATA–box sits straight in the groove ofthe protein.

The structure of the complex, published a year later by Kim et al. (Nature 365, pp. 520–527) revealed that theDNA was actually heavily bent.

Tim Grüne Macromolecular Structure Determination 8/39

Molecular Biology PhD Programme 2011

Low Resolution Maps are Difficult to Interpret

1989 1995

This (Cα-only) model was published in 1989(PDB–entry 1phy)

The correct version: published six years later(PDB–entry 2phy)

Kleywegt, Acta D(2000), D56

Tim Grüne Macromolecular Structure Determination 9/39

Molecular Biology PhD Programme 2011

Validation is still a Current Issue

The examples shown on the previous slides are relatively old, and such big mistakes are relatively unlikely tooccur nowadays:

• Validation tools have advanced• Data (measured amplitudes) must now be deposited together with the coordinates.

Nevertheless, validating a structure from the PDB is still very important

• Programs may contain bugs (cf. “The Great Pentaretraction” from 2007)• Ligand design (one of the major uses of the PDB) depends on local interaction between ligand and protein,

and at this atomic level a model is still the interpretation of the crystallographer and not necessarily unique.

Tim Grüne Macromolecular Structure Determination 10/39

Molecular Biology PhD Programme 2011

Validation: Who and When

Crystallographers must validate to ensure they deposit a correct structure.Users (including non-crystallographers) should check the quality of the deposited model before drawing con-

clusions.

Understanding the quality of a structure becomes particularly important (for non-crystallographers) involved inligand design.

Tim Grüne Macromolecular Structure Determination 11/39

Molecular Biology PhD Programme 2011

Misconceptions in Crystallography

The main misconceptions about crystal structures:

1. Correct structure.• Correct amino acid sequence• Complete model (including ligands & waters)• Correct and accurate coordinates

2. in vivo significance. Crystallisation conditions can be quite unnatural, e.g. crystallisation kits vary the pHbetween 4 and 9. While this probably does not affect the overall structure, it might well affect the interactionbetween ligand and protein (protonation state, . . . ).

Tim Grüne Macromolecular Structure Determination 12/39

Molecular Biology PhD Programme 2011

Precision vs. Accuracy

At 3 Å resolution, the atom positions are known within about 0.5 Å.

If this were a structure determined at 3 Å, the atoms would besomewhere within each sphere.The PDB-file has a precision (number of digits of the coordi-nates) of 0.0005 Å.

ATOM ...0.334 23.441 7.767

However, this is only because we cannot tell the program, “theatom is approximately there”. The computer program works withone number, not a range.The accuracy (reliability of the numbers) of the PDB-file isresolution-dependent and much worse.

Tim Grüne Macromolecular Structure Determination 13/39

Molecular Biology PhD Programme 2011

Precision vs. Accuracy

At 3 Å resolution, the atom positions are known within about 0.5 Å.

If this were a structure determined at 3 Å, the atoms would besomewhere within each sphere.The PDB-file has a precision (number of digits of the coordi-nates) of 0.0005 Å.

ATOM ...0.334 23.441 7.767

However, this is only because we cannot tell the program, “theatom is approximately there”. The computer program works withone number, not a range.The accuracy (reliability of the numbers) of the PDB-file isresolution-dependent and much worse.

Tim Grüne Macromolecular Structure Determination 14/39

Molecular Biology PhD Programme 2011

Precision vs. Accuracy

Accurate, not precise• mean value of all measurement close to real

value• individual measurements vary a lot (large

“standard deviation”)

Precise, not accurate• measurements are close to each other (small

“standard deviation”)• mean value far from real value

Tim Grüne Macromolecular Structure Determination 15/39

Molecular Biology PhD Programme 2011

Means of Validation

There are two types of validation:

1. Conformance of the model with what we expect, e.g. bond angle deviation, bond distance deviation, R,Rfree, for reflections: completeness, I/σI , Rint

2. Validation of the correctness of the model by information/ knowledge that was not used during the construc-tion process of the model.

Tim Grüne Macromolecular Structure Determination 16/39

Molecular Biology PhD Programme 2011

The R-value

The term “R-value” is “Mr. Smith” for statistics: The letter “R” is used in a large number of different meaningsand one usually has to know from the context which one is meant. Even within crystallography there are several“R-values”.

The general meaning of an R-value is, however, always the same: It describes the discrepancy between mea-sured data and calculated or predicted data, i.e. it tests our hypothesis (the model) against the experiment,based on the theory.

Tim Grüne Macromolecular Structure Determination 17/39

Molecular Biology PhD Programme 2011

Data-Conformance: The Rwork

The R-value for refinement is sometimes called Rwork. It is calculated by all refinement programs∗ (phenix,refmac5, shelxl,. . . ) as

Rwork =

∑hkl (w |F (hkl)| − |Fcalc(hkl)|)∑

hklw |F (hkl)|

At “normal” resolution ranges (1.8-3Å, say) the Rwork should be around 10% of the resolution, e.g., a 2.3Å dataset should have a final Rwork around 0.23 = 23%. Irrespective of the resolution an Rwork-value worse than30% should rise suspicion.

For more precise estimates, see e.g. Tickle et al., Acta Cryst. 1998, D54, pp. 547

NB: For an engineer, such a high R-value (for any type of measurement) would be horrendously high. But that is the fate of protein

crystallography - the data are very poor and we have to make the best of it.

∗The programs differ, though, how they calculate Fcalc

Tim Grüne Macromolecular Structure Determination 18/39

Molecular Biology PhD Programme 2011

Limits of Rwork

The rationale behind Rwork seems reasonable: We want to create a model that comes as close as possible tothe data, so we want to reduce the difference, alias the Rwork.

However, it is possible to arbitrarily reduce the Rwork, e.g. by filling up the difference density with watermolecules. This is called overfitting and probably happens to some extent in all structures which are not atomicresolution (1.2Å or better).

It is also possible to fit a protein completely the wrong way round and still arrive at the same Rwork (Kleywegt,Jones, Structure 1995, pp. 535).

Tim Grüne Macromolecular Structure Determination 19/39

Molecular Biology PhD Programme 2011

Fooling Rwork

A drastic example how to create a model that fits the data well.

• Take a 1.2 Å data set• Fill the unit cell with a grid of atoms→ no chemical meaning

at all• Refine the atoms without constraints or restraints (withshelxl)

Tim Grüne Macromolecular Structure Determination 20/39

Molecular Biology PhD Programme 2011

Fooling Rwork

A drastic example how to create a model that fits the data well.

• Take a 1.2 Å data set• Fill the unit cell with a grid of atoms→ no chemical meaning

at all• Refine the atoms without constraints or restraints (withshelxl)• The result still makes chemically no sense at all.

Tim Grüne Macromolecular Structure Determination 21/39

Molecular Biology PhD Programme 2011

Fooling Rwork

Resulting Rwork at different resolution:

Resolution Rwork3.0 Å 7.2%2.5 Å 17.8%2.0 Å 27.7%1.5 Å 33.7%1.2 Å 38.6%

Observations:• The Rwork drops as the resolution gets worse: the lower the resolu-

tion, the easier the data can get fooled.• At 3 Å resolution Rwork is suspiciously low.• Even at 1.2 Å resolution, theRwork is not outrageously high, but could

simply indicate an incomplete model.

Tim Grüne Macromolecular Structure Determination 22/39

Molecular Biology PhD Programme 2011

The Guard: Rfree

A very good way of validating a structure is to give the data to two (or more) crystallographers and have themboth build a model independently.

When the structures compare equal, the structure is probably correct (at least some of the subjectivity of thestructure would be removed).

This approach is rather impractible. One calculates the Rfree instead.

The concept of Rfree has been known in statistics for some time and was introduced to crystallography in 1992by Axel Brünger (Note: the structure 1phy in the above example was before then).

Tim Grüne Macromolecular Structure Determination 23/39

Molecular Biology PhD Programme 2011

The Guard: Rfree

Rwork suffers from model bias.

Therefore, a test set of 500-1000 randomly selected reflections is generated before refinement and modelbuilding. This test set is put aside and never used for model building or refinement.

Rfree is calculated similarly to Rwork, but because the test set is not used in refinement, it is not as biased.

Tim Grüne Macromolecular Structure Determination 24/39

Molecular Biology PhD Programme 2011

The Guard: Rfree

The value of Rfree should be roughly “3-5%” worse than Rwork (again, see Tickle et al., Acta Cryst. 1998, D54,pp. 547).

Example: One of the structures of the “Great Pentaretraction” reported an Rwork = 38% and Rfree = 45%.Even though this is only a 4.5Å structure, this is quite high. The correctly refined structure reported Rwork =

28% and Rfree = 31% (P. D. Jeffrey, Acta Cryst. 2009, D65).

Tim Grüne Macromolecular Structure Determination 25/39

Molecular Biology PhD Programme 2011

Rfree of our Thought Experiment

Resolution Rwork Rfree3.0 Å 7.2% 55.3%2.5 Å 17.8% 52.6%2.0 Å 27.7% 56.6%1.5 Å 33.7% 54.5%1.2 Å 38.6% 54.4%

The Rfree is equally poor at all resolutions (an R-value of above 50% generally indicates a random model).

Tim Grüne Macromolecular Structure Determination 26/39

Molecular Biology PhD Programme 2011

Global and Local Validation

With Rwork and Rfree we have one number each that describes the quality for the whole model. They aretherefore called global quality indicators or global figures of merit.

More detailed insight into the quality of a model is provided by local quality indicators like

1. Real space correlation coefficient2. Ramachandran plot3. Kleywegt plot.

Tim Grüne Macromolecular Structure Determination 27/39

Molecular Biology PhD Programme 2011

Real Space Correlation Coefficient

The Real Space Correlation Coefficient (RSCC) compares the model to the electron density on a per-residuebasis (instead of calculated data to measured data as the R-values do).

RSCC (black) and B-factor (blue) forthe BotLC/B protease in complex withsynaptobrevin-II (2.0Å). The B-factor forthe ligand is very high and the RSCC verylow for the ligand.Maybe the authors were too optimistic fit-ting the ligand.

Tim Grüne Macromolecular Structure Determination 28/39

Molecular Biology PhD Programme 2011

Dihedral Angles: the Ramachandran Plot

The Ramachandran plot is probably the most famous validation tool. It is based on the two dihedral angles ψand ϕ of the peptide backbone.

Φ is the angle between the two planes defined by Ci−1 − Ni − Cα and Ni − Cαi − Ci. Ψ is the angle between the two planes of

Ni − Cαi − Ci and Cα

i − Ci −Ni+1

Tim Grüne Macromolecular Structure Determination 29/39

Molecular Biology PhD Programme 2011

The Ramachandran Plot

The Ramachandran plot shows the φ vs. ψ angles for a structure and the most probable regions derived fromthe 500 best determined protein structures.

Interactive Ramachandran window of the modelbuilding program Coot. Everything outside theshaded region is an outlier and deserves a closerlook. Outliers can be justified in well-ordered re-gions if, e.g. the residue makes a special contact toanother residue.

β–strand

α–helix

Tim Grüne Macromolecular Structure Determination 30/39

Molecular Biology PhD Programme 2011

The Kleywegt Plot

The Kleywegt plot is derived from the Ramachandran plot in the presence of homo-oligomers in the crystal.

Proteins crystallise from solution. Therefore, allmolecules in the crystal are similar, and their Ra-machandran plots should be similar.The Kleywegt plot compares corresponding dihe-dral angles in the different molecules and plotslarge deviations.

Kleywegt, Acta D(2000), D56

Tim Grüne Macromolecular Structure Determination 31/39

Molecular Biology PhD Programme 2011

Validation Programs

There are quite a few programs for the validation of macromolecular structures, e.g.

• WhatCheck (swift.cmbi.ru.nl/gv/whatcheck)• SFcheck (part of CCP4)• Uppsala EDS (eds.bms.uu.se/eds)• MolProbity (molprobity.biochem.duke.edu)

The PDB-Webservice www.pdbe.org has a direct link to the EDS server which offers several visualisation toolsfor validation.

Tim Grüne Macromolecular Structure Determination 32/39

Molecular Biology PhD Programme 2011

Molprobity

The Molprobity server provides a convenient way to check the geometry of a macromolecular structure. It worksfor both proteins and nucleic acids.

Molprobity checks for

• (too) close contacts of atoms, e.g. between ligand and protein• possible flips for Asn, Gln, His• Ramachandran plot• . . . and a few more things

Tim Grüne Macromolecular Structure Determination 33/39

Molecular Biology PhD Programme 2011

Molprobity - Flips

At anything less than atomic resolution one cannot distinguish between N,C,O. For His, Gln, Asn, this meanswe cannot tell their orientation except by investigating the chemical environment (network of hydrogen bonding).

←→

←→

Tim Grüne Macromolecular Structure Determination 34/39

Molecular Biology PhD Programme 2011

Molprobity Example Output - Flips

Tim Grüne Macromolecular Structure Determination 35/39

Molecular Biology PhD Programme 2011

Molprobity Example Output - Rotamers, Clashes, . . .

Tim Grüne Macromolecular Structure Determination 36/39

Molecular Biology PhD Programme 2011

Molprobity Example Output - Score

Tim Grüne Macromolecular Structure Determination 37/39

Molecular Biology PhD Programme 2011

Some Advice

When you download a structure from the PDB,

1. run it through Molprobity and check the summary statistics

2. have a look at the plots from the EDS (via www.pdbe.org)

Tim Grüne Macromolecular Structure Determination 38/39

Molecular Biology PhD Programme 2011

References

1. A. M. Davis, S. A. St-Gallay, G. J. Kleywegt, Limitatios and lessons in the use of X-ray structural informationin drug design. Drug Discovery (2008), Vol. 13, pp. 831–841

2. A. M. Davis, S. J. Teague, G. J. Kleywegt, Application and Limitations of X-ray Crystallographic Data inStructure-Based Ligand and Drug Design, Angewandte Chemie (2003), 42, pp.2718–2736

Tim Grüne Macromolecular Structure Determination 39/39