data quality and model parameterisation martyn winn ccp4, daresbury laboratory, u.k. prague, april...

Data quality and model parameterisation

Martyn WinnCCP4, Daresbury Laboratory, U.K.

Prague, April 2009

Model Parameters

E.g. asymmetric unit contains n copies of a protein of N atoms

Coordinates

3 x N x n xyz co-ordinates

or ... 6 x M x n if each protein modelled as M rigid bodies

or ... ~ 0.5 x N x n torsion angles

Displacement parameters

1 x N x n B factors

or ... 6 x N x n anisotropic U factors

or ... 20 x M x n if each protein has M TLS groups

Model Parameters (2)Occupancies

Usually fixed at 1.0 for protein

... except for alternative conformations (usually sum to 1.0)

Water/ligand occupancies

Scaling parameters etc.

koverall, Boverall, kBabinet, BBabinet, ksolvent, Bsolvent

twin fraction

Ultra-high resolution

Multipolar expansion coefficients

Interatomic scatterers

Reflection DataNumber of independent reflections, dependent on:

– spacegroup– resolution– completeness

For each reflection, one has at least F/sigF.Might also have reliable experimental phases φ or F(+)/F(-)

Data / parameter ratioRefinement means minimise -log(likelihood):

Nonlinear function of model parameters.

Global minimum and many local minima.

Need good data/parameter ratio.

Strong dependence on resolution.

No strong dependence on protein size.

Generally not enough data ....Reduce number of parameters - constraintsAdd data - restraints

RestraintsExpected geometry of the protein treated as additional data

bond lengthsbond anglestorsions / dihedral (but not φ,ψ)chirality (e.g. chiral volume)planaritynon-bonded (VdW, H-bonds, etc.)B factors (between bonded atoms)U factor restraints (similarity, sphericity, rigid bond)NCS (position or conformation)

Data / parameter ratio

Not really true ... assumes all data independentbond lengths and angles and planar restraints in ring systembond length restraint vs. high resolution diffraction data

Estimate as: no. reflections + no. restraints no. parameters

Restraints may be more necessary in poorly determined parts of the structure.

Restraints have associated weights:Overall w.r.t. reflection data

Individual weights e.g. WB

calmodulin at 1.8 Å (1clm) 1132 protein atoms, 4 Ca atoms, 71 waters 4828 x, y, z, B factors

No. of unique reflections 10610 (deposited 1993 no test set!)

data/parameter = 2.2

Bond restraints: 1144Angle restraints: 1536Torsion restraints: 429Chiral restraints: 170Planar restraints: 874Non-bonded restraints: 1391B factor restraints: 2680(no NCS)

total restraints = 8224 data/parameter = 3.9

calmodulin at 1.0 Å (1exr)

1467 protein atoms (inc. alt. conf.), 5 Ca atoms, 178 waters 4950 x, y, z+ 9900 anisotropic U factors+ 316 occupancy parameters total parameter count = 15166

No. of unique reflections 77150No. in test set 7782 (10%)Data for refinement 69368

No. of restraints (PDB header) 22732



GCPII at 1.75 Å (3d7g)

5724 protein atoms (inc. alt. conf.), 211 ligand atoms, 617 waters 26046 x, y, z, B factors + 162 anisotropic U factors (S, Zn, Ca, Cl only)+ 225 occupancy parameters total parameter count = 26433

No. of unique reflections 105077No. in test set 1550 (1.5%)Data for refinement 103527

No. of restraints (PDB header) 44652



Thioredoxin reductase at 3.0Å (1h6v)

22514 protein atoms, 552 ligand atoms, 9 waters 92300 x, y, z, residual B factors 6 TLS groups 120 TLS parameters

No. of unique reflections 69328No. in test set 3441 (5%)Data for refinement 65887

No. of restraints 209378(inc. 44484 NCS restraints)



Avoiding overfitting: RfreeWhat's wrong?:• Can add any old parameters to improve R-factor, when low

data/parameter ratio• May not be physically correct – "overfitting"

Solution:• Calculate R-factor on a set of reflections not used in

refinement = "Rfree"• If changes to model improve Rfree as well as R, then they are

good.• Note: Rfree is global number - useful for refinement

strategies, not useful for assessing changes to a few atoms

Choosing your free reflections

• Usually a randomly chosen subset.• Typically 5-10% (CCP4 default is 5%)• If you have enough reflections, impose

maximum number (2000 in phenix.refine)• Free set also used in maximum likelihood to

estimate σA parameters

Rfree and NCS• NCS operators map different regions of reciprocal asymmetric

unit onto each other. Reflections in these regions are correlated.

gaps = free set

working reflections

free reflections

Rfree and NCS• Solution: choose free set from thin shells in reciprocal space

Pros:NCS operators link regions of same resolution which should be both in a shell or outside it

Cons:Large number of shells thin shells most free reflections close to edge and correlated to non-free reflectionsSmall number of shells significant gaps in resolution range, poor determination of σA

SFTOOLS: RFREE 0.05 SHELL 0.001

3rd argument = width of shells in Å-1

Also DATAMAN.

Width 0.013 shells

Width 0.001320 shells(default)

1xmp (1.8 Å)

Width 0.0053 shells

Width 0.000520 shells(default)

XXX (3.8 Å)

• Can increase size of free set to mitigate edge effects• Or use NCS-related free set islands

• Reflections also correlated to immediate neighbours in reciprocal space - can exclude these from working and free sets

Fabiola, Korostelev & Chapman, Acta Cryst D62, 227, (2006)• Rapidly run out of working reflections!

Be aware that correlations can artificially reduce your Rfree

Rfree and NCS

Rfree and twinning

Twinning operator might relate e.g. reflection (1,2,3) to (2,1,-3)

These two reflections should both be in the working set or the free set.

1. Select free set in thin shells (as NCS) 2. Select free reflections in higher lattice symmetry

Transferring free R setsUse the same free set for:

additional datasets for same proteindatasets from isomorphous proteins (derivatives, complexes, etc.)(how isomorphous is not clear, but play safe ...)

Otherwise initial R & Rfree will be similar and low for second structure - it has been refined against most of your free reflections

Further refinement may lead to divergence of R & Rfree, masking the bias. Harder to detect over-fitting. Although may eventually reset Rfree.

How:Use "CAD" / "Merge MTZ files (CAD)" in CCP4.

Useful resources

http://ccp4wiki.org/ - CCP4 Wikihttp://strucbio.biologie.uni-konstanz.de/ccp4wiki/ - CCP4

community wikiProceedings of Study Weekend 2004 (Acta Cryst D, Dec 2004)

http://ccp4wiki.org/

http://strucbio.biologie.uni-konstanz.de/ccp4wiki/

data quality and model parameterisation martyn winn ccp4, daresbury laboratory, u.k. prague, april...

Documents

waters26046 x

waters92300 x

waters4828 x

waters4950 x

planar restraints

torsion restraints

chiral restraints

angle restraints