lab exercises 2017 - aarhus universitet

Applied Structural Biology

Lab manual 2017

2

Overview

Day 1 Crystallization Setting up hanging and sitting drops Sparse matrix, Footprint and grid screens

Day 2 Optimization and crystal handling Scoring of crystallization experiments from day 1 Design of optimizing screens

Heavy atom soaking of proteinase K crystals Cryo-protection and –cooling of proteinase K crystals

Day 3 Data collection and processing

Data collection, basic principles Data integration using iMosflm Data reduction and scaling using Scala

Day 4 Phasing Introduction to the CCP4 program suite: Intro to Coot SAD/MR

Day 5 Refinement and modelling Model building in Coot Refinement Validation of the model

Day 6 Structure analysis and docking Docking in EM/SAXS maps using the Situs package Electrostatics, structural homology and conservation

3

Introduction Crystallization A macromolecule like protein, RNA or DNA can be brought to crystallization like most other compounds. The starting point of a crystallization experiment is a concentrated very pure solution of the macromolecule. A typical example would be a protein solution in the range 5-15 mg/ml. Crystallization of a protein and the formation of amorphous precipitate are two competing processes. These can be induced by changing the solution of the protein such as to make it energetically unfavourable to stay in solution. By far, the most commonly used method for doing so is by changing the composition of the solution containing the protein. In practice this is often done by adding a precipitant to the solution. The precipitant will then compete with the protein for the water in the solution. The less water surrounding the protein, the more favourable it is for the protein to enter either the crystalline or the amorphous phase. Of the two phases, the latter is most often the kinetically favourable, whereas the crystalline is the thermodynamically most stable phase. Amorphous precipitate Soluble protein Crystal k1 > k2

When crystallization of a new protein is started, you can choose different approaches to obtain the first crystals. The simplest method is to test the basic parameters: pH and the concentration of the precipitant by applying a simple matrix screen. Below is shown an example of the principle of matrix screening.

pH NaCl NaCl MPD MPD PEG 4K PEG 4K 4,5 1 2 1 2 1 2 5,7 1 2 1 2 1 2 6,9 1 2 1 2 1 2 8,1 1 2 1 2 1 2

Example of a matrix screen using NaCl, PEG 4K, and MPD as precipitants. The numbers 1 and 2 refer to different concentrations of the precipitants. By applying matrix screening, you can obtain a good impression of the solubility of the protein in question. Before constructing the matrix, the approximate precipitation point can be found by applying a precipitation test, which is a simple titration of the protein with a precipitant at a given pH. Another method for screening conditions is the use of random experiments. The experiments are scattered in crystallization space, spanned by the different variables you wish to test, with respect to their influence on the

k1 ¬ (®) k-1

k2 ® ¬ k-2

4

crystallization of the protein. This method is different from matrix screening in that it does not apply all the possible combinations of the chosen variables, as it is done in a matrix screening. Instead the different variables are combined by using a random number generator and the following restrictions: 1) All combinations of 2 variables will appear at least once 2) The experiments must cover the crystallization space as widely as possible. This method is known as the incomplete factorial approach, or short just infac. The method may seem less simple, but it enables you to test many variables with relatively few experiments, thus the method is very efficient. A spreadsheet-based infac generator is available at our web site: http://www.bioxray.au.dk/mimer/Mimer/Features.html

pH precipitant precipitant level

divalent kations

[protein] mg/ml

4.5 NaCl 1 0 40 5.7 PEG 2 Ca2+ 80 6.9 MPD 3 Zn2+ 8.1

Example of the variables tested in an incomplete factorial approach experiment. Matrix screening these conditions would demand a number of 4x3x3x3x2=216 experiments. Using infac the number of experiments needed is much smaller. A third and very different approach is using one of the commercially available screens, e.g. those one from Hampton Research (Hampton Screen I and II each consisting of 50 different solutions), Emerald BioSystems (Wizard Screen 1 and 2) and Molecular Dimensions (Structure Screen 1 and 2). These solutions have been chosen from an earlier success criterion, since each of these solutions had proven successful in earlier crystallization attempts. Often, when applying these solutions, a hint is given as in which direction to perform experiments, and the screens are often used as a first ‘quick and dirty’ attempt, when a new protein is tested. The majority of crystallization experiments are performed by using the method of vapour diffusion. The experiment is started by mixing the protein with the reservoir, typically in a ratio of 1:1. The drop, typically 2-10 µl, now equilibrates by the evaporation of water from the drop to the reservoir. This causes the precipitant concentration to slowly change in the protein solution, thereby hopefully approaching conditions favourable for crystal formation. The method only demands small volumes of protein. Scoring the experiments: After setting up the experiments, the drop needs time to equilibrate. The time needed depends on the precipitant in use. The time is approximately 24 hours when using salts as precipitating agents, while it can be a week for trials with PEG as a precipitant The time needed further depends on the size of the drop, the temperature, the geometry of the setup etc. After a point of equilibration has been reached, the experiments are scored, applying a simple scale reflecting the amount of

5

Protein solution

precipitant and/or crystals seen in the drops, as well as the type of precipitant/crystals. Based upon these experiments new screenings are applied.

Vapour diffusion by the sitting drop method and the hanging drop method. Seeding: The first crystals obtained may be of a low quality. This is because the crystallization conditions have not yet been optimized. In order to obtain better crystals, you can do many things. The most obvious is to test more crystallization conditions in the vicinity of the already performed experiments. Alternatively you can prepare experiments, where you directly apply the seeds needed for crystal growth from the already formed crystals, instead of waiting for spontaneous nucleation using vapour diffusion. This is called seeding. The simplest way is streak-seeding, where the first crystals are touched by a hair, and this hair is then pulled through the protein solution of the new experiments, thereby applying a few seeds attached to the hair. Another method, microseeding, is performed by crushing a crystal in a small volume of liquid. This solution containing little crystal pieces (read: seeds) is then diluted several times, and a small volume from the various solutions from the dilution series is then added to the protein solutions.

H2O

Reservoir

H2O

6

Heavy atom derivatives For data collection and subsequent structure determination by e.g. the MIR or the MAD method heavy atom derivatives of the crystals must be prepared. A heavy atom derivative is basically crystals of the protein with heavy atoms specifically attached to it. Ideally the specific binding of the heavy atoms must not disturb the crystal lattice. In order to manipulate with crystals, it is most often necessary to define a protein-free solution where the crystal is still stable. That the crystal is stable means that it doesn’t crack nor dissolve. A stable protein free solution will typically contain the precipitant in a slightly higher concentration than the mother liquor (consider why ?) as well as all other components as the mother liquor, including buffer, cofactors etc. To move the crystals into a new solution of a different composition than the mother liquor is called soaking. Protein crystals should rather be described as an ordered suspension, rather than actually crystalline, since approximately 50% of a protein crystal is solvent. The solvent is present e.g. as large solvent channels, crossing the entire crystal. These channels are formed partly due to the globularity of many proteins.

The reagents that are soaked into a crystal are entering the crystal via the solvent channels. Small molecules (e.g. most heavy atom compounds) diffuse rapidly into the crystals, though not as rapid as when freely diffusing. Apart from heavy atom soaking, soaking can be used to form crystals of the proteins in question with cofactors, substrates, inhibitors, given that the molecules freely diffuse into the crystals and find the specific binding sites without disturbing the crystal lattice. A final very important soaking procedure is when soaking the crystal into cryo-solutions (see later).

In this laboratory course we will cover the practical aspects of determining a protein crystal structure by X-ray crystallography. More specifically, we will

Example of crystal packing. The solvent channels are easily seen . The unit cell is marked. The two molecules per asymmetrinc unit have been colored yellow and red respectively. The space group is P3121 with 6 asymmetric units per unit cell and 63 % solvent. L.Jenner et al. ”Crystal structure of the receptor-binding domain of a2-macroglobulin.”(1998) Structure 6, 595-604.

7

determine the structure of proteinase K, a 28 kDa enzyme, by performing all the steps of a typical crystallographic experiment: crystallization, screening for heavy atom derivatives, collecting data and reducing data, space group determination, calculating SAD and MR phases, interpreting electron density, refining and validating structure models and docking into EM/SAXS envelopes.

The exercise is based on a course at UCLA and further information can be found at http://www.doe-mbi.ucla.edu/~sawaya/m230d/m230d.html

Proteinase K is a serine protease with low specificity. It has the great property of crystallizing under many different conditions and has been used as a model protein for e.g. low gravity crystallization experiments in outer space. There are several important questions the crystallographer should ask about a protein before beginning a crystallization experiment. What is the molecular weight of the protein? What buffer will the protein be stable in? Will the protein remain active if I boil it in concentrated HCl? In the early days of crystallography, these questions could be directed to the biochemist who had painstakingly characterized the protein in full detail for many years in advance of structural studies. Nowadays, with the advent of structural genomics projects, the primary sequence may be the only information a crystallographer has to go on. In this section, we explore how many of these important questions can be answered on the internet given only the primary sequence.

Try to go to http://www.uniprot.org/and type PRTK_TRIAL in the query field and press enter to go to the Protein Knowledge Base (UniprotKB) www page for Proteinase K. Scroll down to Sequences. In the window where ‘blast’ is written, activate the pull-down menu, select ProtParam and click go. Select chain 106-384 (mature proteinase K protein). Now you can see a number of properties of the primary sequence such as number of cysteines and methionines. Go back to the UniprotKB page and explore Proteinase K further.

Exercises to be completed before the lab exercise:

To prepare for the practical lab exercises, a number of theoretical exercises must be completed before coming to the lab course. Your answers must be approved by your instructors before you can complete the course.

1) Go to http://www.curriculearn.dk/and register for a student account using your AU student number (årskortnummer).

2) Once registered and logged in, you should be able to see the exercises by clicking "Attend" next to the Applied Structural Biology 2017 course in the course overview.

3) Study the questions inside the course and submit your answers using the interface. You can submit multiple times and all your versions will be recorded.

8

Proteinase K crystal gallery

10

Practicals

The practicals 1-5 all start with five questions (‘quickies’) in a multiple choice format. You will find these questions on curriulearn.dk in separate folders. Please note that sometimes more than one answer is correct! You must submit your answers to these questions before coming to the corresponding lab exercise and at the following practical, the instructors will go through the questions and explain the correct answers. Day 1 Crystallization I Part one: The Footprint Screen Objective: Testing the solubility of Proteinase K

Method: Sitting drop vapour diffusion, Footprint screen

Notes: In this screen (24 experiments), a small, middle and large sized PEG at four concentrations each and at acidic, neutral and basic pH are used for testing of the solubility of Proteinase K in PEGs. Likewise, three salts (AmS, Na/KHPO4 and NaCitrate) are used at four concentrations and acidic, neutral and basic pH to test salt solubility of Proteinase K. This information is valuable in order to set up additional screens to obtain crystals.

Materials: 1) Proteinase K from Tritirachium album (40 mg/mL in 25 mM Tris-HCl pH 7.0) purchased from Sigma, (cat no. P2308).

2) MD Footprint screen I/II (MDFP1/MDFP2)

3) Chryschem plates sitting drops, 24-wells (Hampton Research, cat. no. HR3-160)

4) Sealing tape (Hampton Research, cat. no. HR4-510)

Procedures:

1) Work in groups of two.

11

Tray 1: Each group sets up a Footprint screen (tray 1) according to the tables below. Even group numbers set up Footprint screen 1 and uneven group numbers set up Footprint screen 2.

12

2) Pipet 400 ul of each solution into the appropriate well. Start pipetting row A1-6. After pipetting, cover the row with scotch tape and continue with row B1-6. This is done to avoid dust particles in the drops.

3) Drops: Trays 1 and 2 (see below) are sitting drop experiments (a) With the P-10 or P-20 Pipetman, pipet 2.0 µL of the proteinase K solution onto the bottom of each concave sitting drop post in the ChrysChem plates. Use a steady hand to keep the drop in the form of a nice round bead. Remember, "bubbles means troubles."

How to avoid blowing bubbles in the drop: Normally, you expel liquids from the Pipetman by pushing the plunger to the second stop. The distance between the first stop and the second stop blows air through the tip which ensures that the entire sample is expelled. Unfortunately, this feature also blows bubbles in the drop you are forming in the depressions. You can avoid blowing bubbles by pushing the plunger to the first stop only. If you do get a bubble in your drop, you may remove the bubble by holding the pipet veritically and just touching the top of the drop. Pipet the first 6 drops, then go to b)

(b) Pipet 2.0 uL of reservoir A1 onto drop A1, the components will mix by convection, there is no need to mix with the Pipetman. Mixing just increases the likelihood of smearing the drop. (c) Add reservoir A2 to drop A2, etc. After mixing the first 6 drops cover the row with scotch tape. (d) Repeat for row B-D. Finally, carefully label your tray with a marker. Label the tray number, your names, the date, and the protein name.

Protein

H2O

Sitting drop experiment

13

Part two: Sparse Matrix Screening Objective: To obtain initial crystallization conditions that can be used for further optimization in later screens Method: Sitting drop vapour diffusion. Screening (many different conditions)

Note: When presented with the challenge of crystallizing a protein for the first time, dozens of widely varying conditions must be screened before one finds crystals. This method called sparse matrix screening is usually performed using commercial screening kits, such as those sold by Hampton Research, Emerald Biostructures or Molecular Dimensions.

Materials: 1) Proteinase K from Tritirachium album (40 mg/mL in 25 mM Tris-HCl pH 7) purchased from Sigma, (cat. no. P2308) 2) Structure Screen 1 and 2 (Molecular Dimensions, MD1-01ROWORLD and MD1-02ROWORLD)

3) Chryschem plates sitting drops, 24-wells (Hampton Research, cat. no. HR3-160)

4) Sealing tape (Hampton Research, cat. no. HR4-510)

Procedures:

Tray 2: Each group sets up 24 sparse matrix crystallization trials (tray 2) from Structure Screen 1 or 2 (from Molecular Dimensions) according to your groups assignment (ask instructors).

1) Reservoir solutions: Pipet the indicated amount of reagent to each of the 24 wells. Use 400 ul of the sparse matrix solutions.

Part three: Grid screening Objective: To obtain initial crystallization conditions with a specific precipitant (PEG or salt) that can be used for further optimization in later screens Method: Sitting or hanging drop vapour diffusion. Screening a selected precipitant-pH ‘space’)

14

Note: The most successful precipitants are AmS and PEGs. Often, if you can crystallize in a PEG, you can also crystallize in the medium sized PEG2000 MME (or another medium sized PEG). It is therefore a good idea to include AmS/pH and PEG2000 MME/pH grid screens in your first crystallization trials. A grid screen provide a systematic approach to screening the selected precipitant for the crystallization of biological macromolecules. A single precipitant is screened at four unique concentrations versus six precise levels of pH e.g. between 4 and 9. The concentration range of precipitants and the buffer pH are based upon those that are most frequently reported to offer success with ranges coarse enough for thorough sampling but fine enough for complete coverage. Grid screening techniques are highly effective for determining the preliminary crystallization conditions of macromolecules

Materials: 1) Proteinase K from Tritirachium album (40 mg/mL in 25 mM Tris-HCl pH 7) purchased from Sigma, (cat. no. P2308) 2) 3.8M AmS, 50% PEG2000 MME, 0.5M or 1M stocks of Citric acid pH 4.0 and 5.0, MES pH 6.0, Hepes-NaOH pH 7.0, Tris-HCl pH 8.0 and Ches pH 9.0

3) miliQ water

4) Chryschem plates sitting and hanging drops, 24-wells (Hampton Research, cat. no. HR3-160)

5) Sealing tape (Hampton Research, cat. no. HR4-510) and immersion oil

6) Gloves

Procedures:

1) Set up tray 3 as hanging drop experiments and tray 4 as sitting drop experiments (see below) Tray 3: Set up a PEG/pH grid screen according to the table below:

1 2 3 4 5 6 PEG 2000 MME A 5% B 10% C 20% D 30% 0.1M Citric

acid pH 4.0

0.1M Citric acid pH 5.0

0.1M MES pH 6.0

0.1M Hepes-NaOH pH 7.0

0.1M Tris-HCl pH 8.0

0.1M Ches pH 9.0

15

Tray 4: Set up a AmS/pH grid screen according to the table below:

1 2 3 4 5 6 AmS A 0.8M B 1.6M C 2.4M D 3.0M 0.1M Citric

acid pH 4.0

0.1M Citric acid pH 5.0

0.1M MES pH 6.0

0.1M Hepes-NaOH pH 7.0

0.1M Tris-HCl pH 8.0

0.1M Ches pH 9.0

2) Setting up hanging drops (tray 4). Start oiling the edges of the wells of the Chryschem plate (use gloves). Pipet the solutions into the wells after mixing in eppendorf tubes. Take a coverslide and and pipet 2 ul protein solution to the center. Pipet 2 ul reservoir solution A1 into the protein solution drop and put the coverslide on top of well A1 inverting the coverslide at the same time in order to make the drop face the reservoir. Give the coverslide a gentle ‘flick’ using the opposite end of a tip in order to seal the experiment. Be careful that the experiment is completely sealed by an oil ‘rim’ at the edge of the coverslide.

Protein solution H2O

Hanging drop experiment

16

Date:

Experimental details:

Date: Date:

Date: Date:

Date:

18

For the optimised screen, ammonium sulfate concentration varies horizontally across a row. pH varies vertically between rows. The volume of each reservoir is fixed at 500 µL. Be sure to mix each reservoir thoroughly when finished pipetting. You may do this by gently pipetting up and down with a P-1000 Pipetman set to 500 µL. Avoid sucking liquid into the shaft of the Pipetman.

19

Day 2 Crystallization II Part four: Scoring crystallization experiments, optimized screens Objective: Analyse the outcome of first crystallization experiments (Footprint, sparse matrix and grid screens) in order to set up new rational screens hopefully producing (better) crystals.

Method: Sitting drop vapour diffusion, microbatch, seeding

Procedures: Use the scoring sheets on the previous pages to evaluate your crystallization experiments.

Scoring the crystallization experiments When scoring the crystallization trays it is advantageous to use a code, to prevent having to write entire novels about the individual trays. Use the following code and the scoring sheet on p. 19 (and compositions of screens on p. 14 and 20-21) when scoring: -2 Heavily precipitated protein -1 Light precipitate 0 Clear drop 1 Oildrops or phase separation 2 Crystalline material or micro crystals 3 Spherulites (quasi crystals), needles or thin plates 4 Single crystals, note the size and number Get info from the other groups about the score of the sparse matrix experiments in Structure screen 1 and 2 that you did not do yourself. Based on the scores, identify the optimal crystallization conditions for future screens. Write down your suggested ‘super screen(s)’ and show it to the instructors for evaluation. If time allows it use the INFAC generator (see introduction p. 4) to set up an optimization screen with an optimal sampling of your conditions in as few experiments as possible Note: An ideal grid screen contains clear drops in one ‘end’ and precipitate in the other ‘end’. If all drops are clear or precipitated the solubility curve is not ‘sampled’.

20

Tray 5: Now a tray (tray 5 – sitting drop experiment) with optimized crystallization conditions is handed out by the instructors (ask instructors for screen info).

Crystal handling and data collection I Part five: Preparation of heavy atom derivatives Objective: Select a heavy atom and soak it into crystals for derivatization. Ask instructors.

Caution: Heavy atom compounds are toxic to humans. Do not let these solutions come in contact with your skin. Wear gloves. Place all pipet tips in specially provided containers for proper disposal.

Procedures: 1) Select the heavy atom to use, ask the instructors.

2) Prepare a 1:1 and 1:5 dilution of the selected heavy atom.

3) Select three drops with nice crystals from the trays of proteinase K crystals that you grew (including tray 5).

4) Add 1 µL of concentrated heavy atom to drop 1. Add 1 µL of 1:1 diluted heavy atom to drop2. Add 1 µL of 1:5 diluted heavy atom to drop 3 Part six: Stabilization and identification of a suitable cryoprotectant of proteinase K crystals i) Stabilizing proteinase K crystals in protein free solution Solutions: 3.8 M AmS 1 M Tris-HCl pH 7 or 8 50% Glycerol 70% PEG400 Choose crystals from a well, which are undamaged. Good crystals are single, of suitable size (e.g. 100-500 micron on longest edge), have sharp edges and smooth, clean sides. Note the crystallization conditions, and then mix solutions that you think would stabilize the crystals (e.g. if the crystals are grown in 1.0 M AmS mix a 1000 µl solution with 1.2 M AmS in the same buffer).

21

Then open the well with the crystals using a scalpel. Remove and save the reservoir with a p1000 pipette, and replace it with 500 µl of the new assumed stabilizing solution. Add 15 µl of the mother liquor (original reservoir solution) and 15 µl the assumed stabilizing solution to the drop, while following the process using the light microscope –BE CAREFUL NOT TO DAMAGE THE CRYSTALS! Avoid air bubbles. The well is closed temporarily with scotch tape. Remember to close all wells as soon as you do not handle them. Leave the crystals for 2 min, and then remove 15 µl of the drop solution followed by addition of 15 µl stabilising solution. Repeat the procedure 2 times, the third time removing almost all (but not all, don’t dry out the crystal) of the surrounding liquid before adding a final volume of 30 µl containing only the stabilization solution. Now close the well with Crystal Clear tape. Observe after 30 min. if the crystals are still intact. Make at least 5 wells with stabilized crystals. In the case of testing stabilising solutions for new crystals you might have to test more concentrations (e.g. in parallel) of the precipitant if the crystals are not stable and crack or dissolve in you first stabilising solution. ii) Cryo soaking When exposing the crystals to X-rays the crystals are slowly damaged, due to the high energy. In order to protect the crystals from damage, often the data collection is carried out at very low temperatures (100 K). For this, the crystals must be transferred to a solution that does not freeze at 100 K. Ice formed around the crystal would damage the crystal, and furthermore ice also diffracts X-rays, and thus interferes with the diffraction from the crystal. Typically cryoprotectants used are glycerol, sucrose, other sugars or small PEGs. These compounds are mixed with all other compounds of the protein free solution. In cases where the crystals have been obtained at high salt concentration, phase separation can be an obstacle. Also extreme water potentials in high salt/cryoprotectant solutions can ruin the crystals. We now wish to identify solutions of AmS and glycerol, as well as AmS and PEG400 that will freeze without ice formation. Glycerol has the unwanted side effect of making the protein more soluble, and we must compensate for that by raising the concentration of the precipitant. On the contrary PEG 400 is a precipitant by itself, and it can be necessary to lower the AmS concentration when raising the PEG 400 concentration. Make 1000 µl solutions of the following:

1. 50 mM Tris-HCl pH 7, 1.35 M AmS, 15% Glycerol 2. 50 mM Tris-HCl pH 7, 1.40 M AmS, 20% Glycerol 3. 50 mM Tris-HCl pH 7, 1.45 M AmS, 25% Glycerol

4. 50 mM Tris-HCl pH 7, 0.80 M AmS, 25% PEG 400 5. 50 mM Tris-HCl pH 7, 0.70 M AmS, 30% PEG 400 6. 50 mM Tris-HCl pH 7, 0.60 M AmS, 35% PEG 400

22

Test the ability of the solutions to be used as cryoprotectants, by flash-cooling them at the cryostream. Use a cryoloop. Dip the loop into the solution and quickly transfer the loop to the goniometer head shielding at the same time the cryostream. Quickly remove the credit card to flash-cool the crystal. The drop is now cooled by the stream of nitrogen. Watch the loop on the little monitor. A milkywhite color indicates iceformation, whereas a clear film indicates successful cryocooling and a suitable cryoprotectant (given that the crystal tolerates the solution). If any doubt is left, we can expose the loop to X-rays, since ice forms a very significant diffraction. The best cryoprotectant identified will be used in the next exercise to cryo-protection of the crystals before cooling the crystals in liquid nitrogen.

# 1 2 3 4 5 6 7 8 9 10 11

Ice (+/-)

Crystal handling and data collection II Introduction Once crystals have been obtained, the crystals must be mounted in either a capillary or a cryoloop, with the aim of collecting data. The aim is to hold the crystal in a fixed position in the beam while collecting the data. In a capillary the crystal is free of the mother liquor, but not dry, and data are collected at room temperature. In a cryoloop the crystal is surrounded by liquid and data can be collected at cryogenic temperatures.

c-axis

a/b-axis

10 10

Adjustment of vertical arch

Adjustment of horisontal arch

Sleighs for horisontal and vertical translation

Translation along the spindleaxis, also on the goniostat

A capillary mounted crystal on the goniometer head

A cryoloop mounted crystal on the goniometer head

Capillary mounted crystal

Crystal

Mother liquor

Bees wax

23

Part seven: Cryoprotection and –cooling of the proteinase K crystals i) Cryo-protection A suitable cryoprotectant was identified earlier. You must transfer a crystal to this solution. This is done gradually, like when transferring to a protein free solution. If e.g. cryosolution #10 is chosen, first remove most of the surrounding stabilizing solution, add solution #6, leave the crystal to equilibrate 5 min, then repeat with solution #7 etc. Once the crystal is in the right cryo protectant, the crystal is ready to be caught with the loop, and quickly (!) transferred to the goniometer head. We will try this during the data collection tutorial (day 3). ii) Cooling the crystals If you want to store your crystals cooled (e.g. when going to the synchrotron abroad), you cool the crystals in liquid nitrogen. The instructors will show how to do this. Briefly described you cool your crystals by fishing a crystal with a nylon loop in a suitable size. If the loop is too big or too small the crystal will not be perfectly placed in the loop or impossible to catch. You must always wear gloves and glasses when working with liquid nitrogen since it is extremely cold. 1) Find the right size of loop by looking through the microscope and compare crystal with different loop sizes. 2) Put on gloves and glasses. 3) Put the cap with pin and loop on a magnetic wand. 4) Put the vial on a vialclamp and into liquid nitrogen. 5) Remove tape from the well you wish to "fish" a crystal from. 6) Fish the cryoprotected crystal into the loop. 7) When done transfer the loop and the caught crystal to liquid nitrogen quickly but safely. 8) Transfer the loop into the vial placed under liquid nitrogen. It is important to keep the crystal under liquid nitrogen from hereon otherwise the crystal will heat up and be damaged. 9) Transfer the vial with the crystal to a labelled straw for storage.

25

Day 3 Part nine: Data collection The Rotation Method Data is collected by rotating the crystal around an axis (the spindle axis). This means that a ‘1° rotation image’ is an image taken while the exposed crystal is rotated 1 ° around the spindle axis. The rotation method is used for data collection since it is very economical in terms of time. Below is a 0.5° rotation image of proteinase K. Note how the pattern of reflexes form concentric almost circular regions called lunes. These correspond to the intersections of a series of parallel reciprocal lattice planes with the Ewalds sphere. The edges of a lune are start and stop of the rotation. Each lune arises from a single plane in the reciprocal lattice, thus all reflexes in a lune has one index in common.

0.5° rotation image of proteinase K. Crystal:detector distance 100 mm, l =

1.54 Å. Notice the four ‘shadows’ near the center of the image. These are fiber diffraction from the nylon loop containing the crystal.

26

The presence of many zones indicate that there are many, but not all reflexes from several planes pictured in one image. To collect a full dataset (all unique reflexes), one must expose the crystal to X-rays while rotating the crystal, and collect successive images until all reflexes have been registered at least once. The spots on the edges of the lunes are only partially recorded, and will continue on the following image. Therefore one must add the intensities of the reflexes from several images to obtain the full intensity of a given reflex. When collecting the data, it is desirable to rotate as far as possible on one image, in order to collect as few partials as possible, as this will make the subsequent data handling easier. But if the rotation angle gets too large, one will observe overlaps between reflexes from different lunes. So it is important to find an optimal rotation angle with the maximum fully recorded reflections and the minimal overlap. The optimal rotation angle depends on the following:

1. The size of the unit cell. The larger the cell, the smaller the angle. 2. The size of the individual spots, also called the mosaic spread. The

width of the spots is often between 0.1° and 0.5°, depending on the quality of the crystal and the optics of the X-ray apparatus.

3. The resolution. The higher the resolution, the smaller the rotation angle. This is because the degree of overlap is enhanced when further from the centre of the image.

For more, see Z. Dauter (1997) ‘Data Collection Strategy’. Methods in Enzymology, 276, 326-344. Mounting of crystals taken directly from the tray The cooling system is set up to 100 K. First place a cryoloop on the goniometer head. Roughly center the empty cryoloop. This is done in order to ensure that the loop is in the centre of the nitrogen stream once a crystal is in the loop. Now use the cryo-loop to catch a crystal that has been transferred to a cryo buffer. Make sure that you work fast, i.e. once the crystal is in the loop, very quickly transfer it to the goniometer head. Hold a credit card or like in the nitrogen stream shielding the crystal while placing the cryoloop on the goniometer head and then quickly remove the credit card (flashfreezing). Center the crystal, and notice if the cooling has been successful. Is the loop milkywhite or completely clear? Make a test shot as above. Look for ice rings on the image. Choose the best crystal for the actual data collection. First attach a crystal mounted in a cryoloop to a goniometer head, as described above. The goniometer head is attached to the goniostat, and the crystal must now be centered in the beam, using the cross on the monitor as guidance. The centering is done by manipulating the sleighs of the goniometer head. Control the slit size (the width of the slits through which the X-rays passes). The slit size should be smaller than the minimal dimension of the crystal. If you removed the for mounting beamstop, make sure you put it back correctly.

27

As a first try use a rotation angle of 1°, exposure time 60 sec. and a crystal:detector distance of 60. This is all controlled from the computer. If the quality of the crystal is low, thus the diffraction is not good enough (no or little diffraction, high mosaicity, diffraction from several crystals etc.) a new crystal should be tested. When a nice crystal has been found we are ready to collect data. It is then possible to use a program called STRATEGY, that will estimate what rotation angle of the crystals should be chosen as a starting point for the data collection, to minimized the time needed to collect a full dataset.

Integration and Reduction of Data

Introduction As previously mentioned oscillation data are collected by exposing the crystal to X-radiation while oscillating it (e.g. the crystal is oscillated 0.5° while being exposed for 10 minutes). If you wish to collect 90° you will therefore end up with 180 images collected in 30 hours. These data need to be scaled and reduced (multiple occurrences reflections with the same hkl indices are averaged) before you can start using the data for structure determination. Depending on the crystal mosaicity, reflections will be in diffracting position (reciprocal lattice intersecting the Ewald sphere) in a certain oscillation angle interval. Therefore the same reflection might occur on several consecutive images depending on the oscillation angle used for the data collection. To get the full and correct intensity of the reflection, integration over a number of images might therefore be necessary. Some reflections might be overlapping, making it impossible to measure the correct intensity, and these reflections have to be rejected from the data. The intensity might vary with the crystal orientation depending on the amount of “crystal matter” actually being exposed (e.g. the crystal might be long and thin, so that certain orientations will make it possible to expose all of the crystal, whereas other orientations will not).Therefore scaling of the data is necessary so that multiple occurrences of the same reflection can be averaged correctly. The intensity of the X-ray beam might fluctuate over time, also making correct scaling important. In the following you will be introduced to the programs iMosflm and Scala which are part of the CCP4i program suite and used for data integration and scaling and reduction, respectively. The iMosflm manual is available online in pdf-format at http://www.mrc-lmb.cam.ac.uk/harry/imosflm/ver103/

28

iMosflm

Indexing In order to integrate the data the crystal system must be determined and every reflection must be assigned hkl indices. This involves rather complex calculations and considerations. In order to index the data we work with five different coordinate systems: The first coordinate system describes the reciprocal lattice so that all lattice points have integral values. This coordinate system is formed by the reciprocal lattice vectors a*, b*, and c*. The second coordinate system is orthonormal and is usually chosen so that the x-axis is parallel to a*, the y-axis is in the a*-b* plane normal to the x-axis and closest to b*. Finally z is chosen so that the system is right handed. The third coordinate system is defined so that the z-axis is parallel to the X-ray beam, the x-axis is parallel to the oscillation axis, and the y axis normal to the other two so that the system is right handed. The fourth coordinate system is formed so that when the oscillation angle is 0° it is identical to the third. The second and the fourth coordinate systems are related by three angles called missetting angles. The fourth coordinate system oscillates as the crystal is oscillated. Notice that the first coordinate system, that is formed by the reciprocal crystal axes oscillates as well, but this coordinate system is probably not aligned along the rotational axis and not necessarily orthonormal. The fifth coordinate system is formed by the detector, and is therefore two-dimensional. As it is impossible to construct a perfect detector and to align this perfectly to the third coordinate systems x and y-axis the fifth coordinate system describes the deviations of the detector-plane from the third coordinate system The relation between the coordinate systems can be described using a set of matrices such that: (Xf,Yf,f) = M ´ (h,k,l) Xf,Yf,f is the position of a reflection in the detector-plane at a given oscillation angle f, and M is the product of the matrices relating the coordinate systems. Luckily we have iMosflm that will perform the indexing automatically. The relation between the plane of the detector and the X-ray beam can be estimated and will be refined by iMosflm. What we really are interested in is the orientation of the crystal in relation to the X-ray beam.

29

iMosflm (automatic method) starts of with a peak search, where a number of the strongest peaks on the image are selected for indexing. The reflections are mapped to reciprocal space by testing all possible indices. When iMosflm registers a situation in which all reflections have integral values for one index (e.g. h) this corresponds to having found one of the axes (a* and a). iMosflm proceeds until it has found sets of vectors (a*, b*, and c*) corresponding to all 14 Bravais lattices, with the restriction that the volume of the unit cell must be the smallest possible for each lattice type. As there is only one correct solution (not counting sub-symmetry) iMosflm distorts the vectors to fit the lattice type. After auto indexing iMosflm prints out a distortion index that gives information about how much the vectors must be distorted to form the correct lattice type. The correct solution is the Bravais lattice that has the highest symmetry (unless there is some kind of pseudo-symmetry) while still having a low distortion index.

Integration After having determined the crystal system and the orientation of the crystal, this information can be added to the command file and integration of the data can proceed. The goal of indexing and integration is to get values of I (intensity) for each h,k,l indices. The intensity is basically the number of counts from the detector at the position corresponding to h,k,l. However, because of errors it is better to determine the intensity in a slightly different way. When the matrix M is known it is possible to predict the positions of all reflections on an image when the oscillation angle is known. A reflection (spot) is a number of counts (with a maximum) in an area surrounded by background (noise). The physical dimension of the area depends on the size of the crystal and on the quality and width of the X-ray beam. The area is defined manually (spot radius #.#), and also the area around the spot used for calculating the background (box #.# #.#). Instead of just summing all the counts inside the spot area in order to determine the intensity, profile fitting is used. In profile fitting, a large number of very intense reflections from the same part of the detector are used to determine an average profile of the reflection. This will make the estimation of the intensities of the weaker spots more accurate as errors are more pronounced for these reflections than for the more intense spot. After each image has been integrated an output file containing information about the intensity and background for each h,k,l indices is written. This file also contains information about whether or not the reflection is fully recorded on one image or partially recorded (the reflection continues on the consecutive image(s)). The relation between the intensity and the structure factors can be described using the following formula:

Iint(hkl) = l3/w1 V2 ´ e2/mc2 ´ Vcr ´ I0 ´ L ´ P ´ A ´ |F(hkl)|2 Drenth p 108

V volume of the unit cell Vcr volume of the crystal in the beam, this is not corrected for I0 The intensity of the incoming X-ray beam L Lorentz’ factor

30

P polarization factor (depends on the type of X-ray used) A Absorption coefficient (air absorption) In order for the next program to correct for some of these (mostly the geometrically dependent factors such as the Lorentz’ factor, and also P and A.) these are also written to the output file.

Scala

Scaling and reduction As mentioned above the output from the indexing and integration by iMosflm are the intensities and the indices h,k,l, as well as some values necessary for correcting the intensities. Those reflections that occur on consecutive images (partially recorded reflections) are summed to give full reflection intensities. The program Scala applies the correction of the intensities and then it scales the images. Because the intensity of the incoming beam can vary with time, the exposure time can vary, crystal decay etc, it is necessary to apply scaling factors to the intensities so that they can be correctly reduced. Apart from a scale factor Scala also estimates a B-factor for each image. This B-factor (temperature factor) is responsible for the correction of movement within the crystal during data collection and for crystal decay. These two factor are combined into a multiplicative correction factor given by: S = exp(2B[sinq/l]2) / scale One image is set as the default with scale factor 1 and B-factor 0. By minimising the difference between symmetry related reflections on different images the scale factor and the B-factor can be determined for all other images. The scalefactors determined this way are not absolute values but rather relative values determined in order to bring the intensities of reflections on other images onto the same scale as those of the first image. After application of the scale factor and B-factor the symmetry related reflections and those that are recorded multiple times are averaged to give the unique reflections. By averaging a large number of reflections this way the estimates for the intensities become better, and it is possible to do outlier rejection so that intensities for a reflection that differ a lot from the rest of the intensities can be eliminated from the calculation of the average intensity. This reduces the error on the reflection intensities. Therefore it is always preferable to have a high redundancy, that is to have measured each reflection and the symmetry-related reflections as many times as possible. Finally Scala writes a file containing the unique reflection indices with intensities and noise (sigma (I)).

31

Truncate

Structure factor calculation Truncate is a program that calculates the amplitude of the structure factors (F) from the intensities. Basically this is performed by taking the square root of the corrected intensity (see the equation above). Truncate also makes some adjustments according to the number of atoms in the asymmetric unit and it corrects the weakest reflections because they often are wrongly estimated. If high resolution data are available Truncate can also perform Wilson scaling (Rupp chapter 8.5), so that the data can be on an absolute scale.

33

Practicals First Assignment: To produce a crystallographic data statistics table typically found in structural biology journals.

Method: Copy the format of the table below. Substitute the data from your Scala log files for the data in the table. Report data statistics on the collected data sets (native, derivative, or PMSF). Submit your Table 1 to the instructors when completed.

A typical Table 1 for an MIR experiment, adapted from Blaszczyk et al., Crystallographic and Modeling Studies of RNase III Suggest a Mechanism for Double-Stranded RNA Cleavage, Structure, Vol. 9, 1225-1236, December 2001.

Part One: Autoindexing, Integration and Scaling Objective: Define the crystal orientation, lattice and unit cell parameters for the data set. In the example below we have 360 data images, each image containing hundreds of reflections. In order integrate the intensities of the reflections we must be able to fit the geometric position of each reflection to a point in the reciprocal lattice. Hence, each reflection is indexed with a specific HKL value. The initial fitting is typically performed using a single data image and requires the knowledge of several data parameters. Some are known a priori, some are not.

34

Known parameters 1) wavelength 2) crystal to detector distance 3) position of the direct beam (this defines the origin of the reciprocal lattice.) 4) Oscillation angle 5) list of 14 possible Bravais lattices 6) list of the geometrical positions of reflections on a single image Unknown parameters 7) unit cell parameters (3 lengths, 3 angles) 8) Crystal orientation (3 angles)

Procedures

If the case that you are not familiar with linux commands, we have included important linux commands in the back of the appendix.

Log in to one of the workstations. To log in on the linux machines the computer needs to be restarted, if it’s in Windows. Press the Ctrl Alt Delete buttons, click on the red icon in the bottom of the screen and press restart. When a text screen appears use the arrow up key to select the Linux machine and press enter to start up the machine.

Username and password is the same as that used for other AU services (NFIT userID)

When logged in, open a terminal window by clicking on the ´window´ icon in the tool bar in the bottom of the screen. In front of the $, type ls and press enter (below indicated by <CR>) to see the files and folders in the home directory. You should get a list looking something like:

$ ls <CR>

Desktop Documents Music Pictures Public Templates Videos If you do not have all these directories or maybe have more, then don't worry, you are only going to need the directory 'asb', which you will create yourself. If you type:

The 14 Bravais lattices

35

$ mkdir asb <CR>

you will create the asb directory - check if it is there using the ls command. To go into the directory use the cd command:

$ cd asb <CR>

You are now in the asb directory and we now have to copy the data to this directory:

$ cp -r /users/thb/asb/data/files <CR> Now you copied a directory called files to the asb folder. Type cd files and ls to check it out: $ cd files <CR>

$ ls <CR>

A number of subdirectories show up called eu, mr, native, sad and sm. Now go back to the asb directory (using the cd .. command) and make the following directories: ccp4 and mosflm and in the ccp4 directory make a tmp directory: $ cd .. <CR> $ mkdir ccp4 <CR> $ mkdir imosflm <CR> $ cd ccp4 <CR> $ mkdir tmp <CR> If you get lost in the directories, you can always type pwd, and you will be told where you are. $ pwd <CR> The images for native data set are located at /users/thb/asb/data/images/. To look at this folder type: $ ls /users/thb/asb/data/native/images/ <CR> You now see a number of image files: 400 with the _hi_ subscript and 400 with the _lo_ subscript corresponding to high and low resolution sweeps. In this course we will focus on the high resolution data set. Be aware that the above ls command mounts the thb user. After a relative short time it is automatically unmounted. To remount simply type the ls command above again. It is necessary to mount to be able to browse to the data from the iMosflm GUI.

36

iMosflm 1.0.5 Quick Guide This is a quick introduction to the basics of processing data with iMosflm. It doesn’t cover advanced or unusual processing, or even cover all the features available. In this document, you will find out how to launch iMosflm, add images to a session, index from one or more images, estimate the mosaic spread, calculate a data collection strategy, refine the parameters and integrate the dataset. Today we will use iMosflm because it has an illustrative GUI. Underlined text written in italics can be skipped - but contains bonus info. Another program for data processing is XDS. A tutorial for this program written by Laure Yatime can be found in the appendix. Starting the program In the terminal window change directory to: $ cd /users/username/asb/imosflm <CR> type: $ imosflm <CR> After a short delay, the iMosflm main window will open. Each button on the display has a “tool-tip”, which gives more information about its function. Leaving the mouse cursor over the button for a second or two displays the tool-tip. The main window is laid out so that tasks (on the left) are arranged in a logical sequence through the process of integrating a dataset from scratch.

37

Note that all the tasks down the lefthand side of the GUI are greyed out and inactive, except for the “Images” and “History”. This is because it doesn’t make any sense to try to index, calculate a strategy, refine or integrate until you have some images to work with! Clicking on "Session" will result in a drop down menu that allows you to save the current session or reload a previously saved session (or add images):

Loading the images Click on the small “add images” button (top row, fourth from the left), and a file browser window will pop up. By default, files in the current directory with common image name extensions are displayed. You can change directories by typing the full path in the “look in:” window, or by clicking on the “up” or “home” buttons and navigating around by clicking on other directory names. Double-click on an image name to load the image. By default, iMosflm loads the headers of all images with the same template as the chosen image. Note also that the “Index” task is no longer greyed out and can be selected. To add images to a session, use the "Add images..." icon:

Select the correct directory (the default is the directory in which imosflm was launched) from the popup Add Images window. All files with an appropriate extension (which can be selected) will be displayed. Double clicking on any file will result in all images with the same template being added to the session (the template is the whole part of the filename except the number field that specifies the image number). (An

38

alternative is to single click and then click on Open).

Now open a new terminal window from the tool bar at the bottom of your screen.

Go to the data set directory by typing cd /users/thb/asb/data/native/images and ls

$ cd /users/thb/asb/data/native/images <CR>

$ ls <CR>

Now you see the list of image files each corresponding to native data collected during an oscillation of the crystal. The image files have a .mar3450 subscript. Type more native_hi_001.mar3450 at the prompt.

$ more native_hi_001.mar3450 <CR>

This shows you the header of the image file, which contains important information about the detector distance, oscillation angle and so on.

Now go back to the iMosflm GUI and open the images by browsing to the above directory (go from your mosflm directory back to users and then to the thb directory and on from there) and double clicking native_hi_001.mar3450. The phi values for all the images (read from the image header) will be displayed in the Image window. All images with the same template belong to the same "Sector" of data. Multiple sectors can be read into the same session. Each sector can have a different crystal orientation. Note that a "Warning" has appeared. Click anywhere in the warnings box to get a brief description of the warning.

39

In this case, it is because the direct beam coordinates stored in Mar image plate images are not "trusted" by MOSFLM, and the direct beam coordinates have been set to the physical centre of the image. Click on the checkmark on the right hand side to turn off this warning.

40

The direct beam coordinates (read from the image header or set by MOSFLM) and crystal to detector distance are displayed. These values can be edited if they are not correct.

The Image Viewer Once iMosflm loads the image, it will display it in a separate window. This contains tools for zooming in, panning around, displaying spots found, predictions, resolution rings etc. This is a good time to get a feeling for how well your crystal is diffracting, and whether or not you may want to modify your data collection to improve it.

The "Image" drop down menu allows display of the previous or next image in the series. The "View" drop down menu allows the image to be displayed in different sizes (related by scale factors of two), based on the image size and the resolution of the monitor. The line below allows selection of different images, either using right and left arrow or selecting one from the drop down list of all images in that sector. (The image being displayed can also be changed by double-clicking on an image name in the main "Images" window.) "+" and "-" will zoom the image without changing the centre. The "Fit image" icon will restore the image to its original size (right mouse button will have the same effect). The "Contrast" icon will give a histogram of pixel values. Use the mouse to drag the vertical dotted line, to the right to lighten the image, to the left to darken it. Try adjusting the contrast.

41

The six icons on the left, control the display of the direct beam position, spots found for indexing, predicted spots, masked areas, spot-finding search area and resolution limits respectively. These are followed by icons for Zoom, Pan and Selection Tools, and tools for adding spots manually (for indexing), editing masks, circle fitting and erasing spots or masks. The two things that you should always do at this point are to check that the direct beam position is correct (marked by a small green cross - if not shown click the direct beam position icon '+'), and use the masking tools to mask out the beamstop and the beamstop arm shadows. Including spots in integration that are partly obscured by obstructions like these can cause problems in both measurement and subsequent scaling. Select the masked area icon. A green circle will be display showing the default position and size of the beamstop shadow.

Make sure that the Zoom icon (magnifying glass) is selected and use the left-mouse-button (abbreviated to LMB in following text) to drag out a rectangle around the centre of the image. The inner dotted yellow rectangle will show the part of the image that will actually appear in the zoom.

Choose the Selection Tool. When placed over the perimeter of the circle, the radius of the circular beamstop shadow will be displayed. Use the LMB to drag the circle to increase its diameter to that of the actual shadow on the image. The position of the circle can be adjusted with LMB placed on the cross that appears in the centre of the green circle. BE CAREFUL not to move the direct beam position instead of the green circle! Adjust the size and position of the circle so that it matches the shadow.

42

To zoom and pan first select a region of the image to be zoomed with the Zoom tool. Select the Pan tool and pan the displayed area by holding down LMB and moving the mouse. This is rapid on a dedicated machine, but may be slow if run over a network.

Spot Finding By default, two images 90 apart in phi (or as close to 90 as possible) will be selected and a spot search carried out on both images. Found spots will be displayed as crosses (red for those above the intensity threshold, yellow for those below). They will also be shown in the Image Display window. The number of spots found, together with any manual additions or deletions, are also given. You will use the default values selected automatically by iMosflm.

Below text in underlined italic will be cursoric and you may skip it during the exercise - it may be informative when you at a later point are working on your own data. Images to be searched for spots can be specified in several ways: 1) Simply type in the numbers of the images (eg 1,84 above). 2) Use the "Pick first image" icon. 3) Use the "Pick two images ~90 apart" icon (default behaviour). 4) Use the "Select images ..." icon. If selected, all images in the sector are displayed. Click on a image to select it, then double click on the search icon for that image to run the spot search. The image will move to thetop of the list (together with other images that have been searched). Images to be used for indexing can be selected from those that have been searched by clicking on the "Use" button. If this box was previously checked, then clicking will remove this image from those to be used for indexing. It can be added again by clicking the "Select images ..." icon and clicking on the "Use" box.

43

Indexing & estimating mosaicity When you click on the “Indexing” task, the main window pane changes, and spotfinding is performed automatically on the first image and the one that is closest to 90º away in phi; indexing is usually more reliable with two images selected this way, but if you uncheck the box on the right of each image’s statistics, you can deselect either image. If you want more (or different) images, click on the “select images” icon (it looks like a stack of blue plates) to display the complete list of images, and double click on the circular icon just left of the checkbox to initiate a spot search for that image. The predicted pattern for the highlighted solution will be shown on the image display with the following colour codes: Blue: Fully recorded reflection Yellow: Partially recorded reflection Red: Spatially overlapped reflection... these will NOT be integrated Green: Reflection width too large (more than 5 degrees)... not integrated. Since version 1.0.0 of iMosflm, indexing and mosaicity estimation are performed automatically once you enter the index task, so there is usually no need to press the buttons for these actions. If the image selection or indexing parameters are changed, the "Index" button must be used to carry out the indexing.

If indexing is successful, a list of 44 characteristic lattice solutions will be displayed. Good solutions usually have a penalty < 20, and there is often a big jump in penalty between the “worst good” solution and the “best bad” one. The best solutions (with

44

penalty <= 50) will be refined, and the best high symmetry solution will be chosen automatically - this is usually, but not always, the right solution. In the present case you should use the automatically selected solution. The indexing is very sensitive to errors in the direct beam coordinates. These should be correct to better than half the minimum spot separation. Errors in other physical parameters (wavelength, crystal to detector distance) can also result in failure. All these parameters should be checked, for example is the current direct beam position behind the beamstop? If there are any ice rings, these can be used to determine the direct beam position. Several parameters used in autoindexing can be adjusted using icons that appear above the Indexing bar. If the indexing fails Weak images 1. MOSFLM automatically reduces the I/σ(I) threshold for weak images, and it may also reduce the resolution to 4Å, but lower values can be tried. It is important not to include spots that are not "real" a small number of false spots can prevent the indexing from working. 2. Try changing parameters for spot finding Multiple lattices Try increasing the I/σ(I) threshold (default 20), for example to 40 or 60, so that only spots from the stronger lattice are selected. This can be done using the Indexing tab of Processing Options. All cases 1. Include more images in the indexing. 2. In case the crystal orientation has changed, try indexing using only one image. 3. If there are ice rings/spots, use the ice/ring exclusion option. 4. If the cell parameters are known, reduce the maximum allowed cell edge to the known maximum cell edge. This can sometimes help filter incorrect solutions. 5. If the detector distance is uncertain, and the images are high resolution (eg 2Å), allow the detector distance to refine during cell refinement. Note that the indexing is based solely on information about the unit cell parameters. It will therefore be very difficult (or impossible) to determine the correct Laue group in the presence of pseudosymmetry, for example a monoclinic space group with β ~ 90 will appear to be orthorhombic, an orthorhombic space group with very similar a and b cell parameters will appear to be tetragonal. These can only be distinguished when

45

intensities are available (ie after integration) by running POINTLESS. In addition, it is not possible to distinguish between Laue Groups 4/m and 4/mmm, 3 and 3/m, 6/m and 6/mmm, m3 and m3m. This will not affect integration of the images, but it will affect the strategy calculation. In the absence of additional information, the lower symmetry should be chosen to ensure that a complete dataset is collected. The presence of screw axes also cannot be detected, so there is no basis on which to distinguish P21 from P2 etc. This does not affect any aspect of data collection or processing, and can be chosen (on the basis of systematic absences) after integration by running POINTLESS. For time reasons we will cheat a bit. From the indexing, P4 is selected (and at this point you would normally proceed with this selection). Because the structure has already been determined, we know that the space group for this crystal form is P43212. In the spacegroup pull down menu select P43212. Processing always proceeds better with a realistic value for the mosaic spread, so if iMosflm hasn't done this for you, do it before going onto the next step. Mosaicity estimation The mosaicity will be estimated automatically for the preferred solution. However, if another solution is chosen, the mosaicity should be estimated again. Click on the "Estimate" button to estimate the mosaicity. Saving a session

A window will appear which plots the total predicted intensity as a function of mosaic spread, from which an estimate is determined. The prediction will be updated. Typing different values of mosaic spread in the box will allow a visual estimate of the effect of changing the mosaic spread.

46

The session can be saved at any time, using the Session drop down menu referred to in 2.1. The first time that a session is saved, you will be prompted for a filename. The filename convention for saved session files is that the extension is ".mos". Save the current session. Then exit from imosflm (using the Session drop down menu), restart imosflm and read in the saved session. If the program crashes (which happens...), it should be possible to recover all but the latest actions. Restart imosflm and it will pop up a "Recover session ..." window. Selecting the Recover button will restore the session as far as possible. This file is written to the directory ".mosflm" in the users home directory.

Data collection strategy Once the crystal orientation has been determined, it is possible to calculate a data collection strategy and the Strategy icon is no longer greyed out (in fact, all other operations, become possible at this point). Select the Strategy icon. This will open the Strategy window. Select the Strategy icon. This will open the Strategy window.

On the Strategy pane, the completeness of the dataset currently loaded is displayed in

47

pie-chart form and as a percentage, for both unique reflections and for Bijvoet pairs. The current rotation range of the dataset is shown in the large sub-pane. Clicking on “auto-complete”, then checking the box labelled “Include existing sectors”, then “OK” will run the strategy calculation and let you know the most efficient way to complete your data collection. The statistics initially presented in the window are based on the assumption that all images between the smallest phi value and the largest phi value in the current sector(s) have been collected (the phi ranges are listed). If processing a dataset that has already been collected (as in this tutorial) this is appropriate. More usually, the Strategy option will be used after two initial images, 90° apart, have been collected, and the statistics will represent the completeness of that 90° of data. The orientation of the crystal, expressed as the angles between the a,b,c unit cell axes and the X,Y,Z coordinate frame, is given. X is along the X-ray beam, Z is the rotation axis. A warning will be given if the unique axis is so close to the rotation axis that there will be missing cusp data. Refinement & Integration The panes for these two tasks are quite similar, so it makes sense to treat them together. Cell refinement It is important to determine the cell parameters accurately before integrating the images. Although the unit cell is refined as part of the autoindexing, providing the diffraction extends beyond ~3.5Å resolution it is possible to obtain more accurate cell parameters using a procedure known as post-refinement. This procedure requires the integration of a series of images in ideally two or more separate segments at widely different values. The distribution of the intensity of partially recorded reflections over the images on which they occur is used to refine the unit cell, crystal orientation and mosaic spread. Select the "Cell Refinement" icon

Selecting the images

48

There are several ways to select the images to be used in cell refinement. Whichever method is used, the image numbers will be displayed in the Images list box.

On entering the Refinement task, iMosflm automatically chooses a suitable selection of images for stable postrefinement of your crystal’s cell parameters, and all refinement flags are set to allow parameters to be refined. Unstable parameters can be fixed by checking the corresponding box. Before starting cell refinement, check that the prediction for the first image to be used is OK. The majority of the spot positions must be correctly predicted, or else the refinement will not work. For this tutorial you will use the automatically selected images. Press the “process” button, and you can watch the progress of refinement in the autoscaled graphs.

The different windows are explained below. Moving the mouse over any parameter will give a full description of that parameters.

49

Check the stability of the refined parameters by displaying the appropriate graphs. To get sensible scaling of the graphs, only parameters with similar numerical values should be plotted together. Observe how the crystal orientation is different for images in the two segments. See how the mosaic spread varies during refinement. Check the spot profile for the images used in the cell refinement.

The detector parameters window The values of the refined detector parameters are displayed for each image as it is integrated. There is the option to fix any of the parameters during integration by checking the "Fix" box on the right. The positional residuals (overall, central and weighted) are also listed. Any of these parameters can be selected for plotting on the graph that appears to the right of this box by simply clicking on that parameter (which is then highlighted in blue, as shown for Beam X and Beam Y).

The crystal parameters window The values of the refined crystal missetting angles f(x), f (y), f (z), the unit cell parameters and mosaic spread for the current image is displayed in the table. Selectable parameters will be plotted as for the detector parameters. Specific unit cell parameters can also be fixed. If more than one segment of data is being used for cell refinement, it is not unusual to see a change in orientation of the crystal between the two segments.

The central spot profile The average spot profile for the central region of the detector is plotted for each image. This confirms that the spot prediction is good. If the profile is not well defined and central in the box, or if the blue border for the peak region of the spot is much larger than the apparent spots size, it suggests a problem with the integration and the initial prediction should be checked. In some cases it may be necessary to increase the "Profile Tolerance" parameters to ensure that the blue box fits the spot. To do this, select the Advanced tab of Processing Options.

50

Once refinement has converged (this may take several cycles; the current state is reported at the bottom left of the main window), both the starting and final values of the cell parameters are reported at the bottom of the pane. If not using the automatically selected images, a list of images can simply be typed into this box. An image series can be specified as n-m where n and m are the first and last images. Different series must be separated by white space or comma. For orthorhombic or lower symmetries, two segments should be given, approximately 90 degrees apart. For trigonal and higher symmetries, one segment can be sufficient, but for all but cubic symmetry, if the c axis happens to be approximately parallel to the X-ray beam for the chosen segment then it will not be well defined. Thus it is safer to use two segments in all cases, and for triclinic data three or four segments are best (eg = 0,45,90,135). The number of images that should be included in each segment depends on the mosaic spread and oscillation angle. A reasonable number is 2*(mosaic spread/oscillation angle +1). MOSFLM will automatically select two appropriate segments (for monoclinic/ triclinic systems you may wish to add additional segments) Integration The accurate cell parameters are now used in the integration. Note that although the images are integrated during the cell refinement, the intensities are not saved and no MTZ file is generated. It is good practice to start by integrating a block of about ten images, to check that the parameters do not need further adjustment. MOSFLM will generate warning messages after integration if there are any difficulties, and it may be possible to improve the situation by changing some of the default parameters. In this tutorial we will integrate the full range of images. The Integration task will automatically include all images in the current session. Note that the cell refinement is turned off by default (see the on-line Tutorial for an explanation). The mtz filename is indicated in a window on the top icon bar; this can be edited if you prefer another name.

The summary window This window presents a summary of how selected parameters have changed in different cycles of the cell refinement. The behaviour of these parameters gives a good indication of whether the cell refinement has been successful. If the refinement has been successful, the RMS residual should be lower for the final cycle than in previous cycles. However, errors in the cell can be compensated quite well by changing the crystal to detector distance, so the differences are not always dramatic.

51

Click on this icon so that the display will be updated. Press the process button. This starts integration, and the progress can be followed in the plots. The integration takes a rather long time. Parameter display windows The refined detector and crystal parameters will be displayed in tables and selected parameters will be plotted in graphs. The average spot profile for each image will also be displayed. The size of these graphs (and the profile) can be expanded to fill the whole window by holding down "shift" and clicking LMB anywhere in the graph window. A second "shift+click" will revert the graph to the original size. In addition to these windows, there are windows that tabulate and plot (as a function of image number) the mean I/s(I) for profile fitted and summation integration intensities. The overall values and the values for the highest resolution bin are given. A display of the standard profiles for different regions of the detector is also provided. Poor profiles are "averaged", by including reflections from inner regions of the detector, and the display indicates which profiles have been averaged and allows inspection of the original "unaveraged" profile (providing there were sufficient spots in that region to allow formation of a profile).

The filename of the MTZ file containing the results of the integration is generated automatically, but can be edited manually.

The profiles should be checked to see that they are well defined and centred within the box. (Poorly defined, diffuse or non-centred profiles may suggest that the prediction is not very good, in which case this should be checked on the image).

The profiles should be checked to see that they are well defined and centred within the box. (Poorly defined, diffuse or non-centred profiles may suggest that the prediction is not very good, in which case this should be checked on the image).

Two icons appear when the integration option is chosen. If the "Show predictions ..." icon is clicked, the display window will be updated as each image is processed. This will slow down the processing, but allows the accuracy of the prediction to be checked for each image.

52

Smooth, gradual changes indicate that integration is probably proceeding well, and obvious discontinuities show that there may be something wrong with the dataset. The average profile for spots in the central part of the detector is displayed on the top right-hand window for each image, and the profiles for the profilefitting regions are displayed for each block of images.

The detector and crystal parameter plots should be examined carefully to check for any instability in the refinement. If there are large and random variations in some parameters (eg the detector twist and tilt) then it may be better to fix them and repeat the integration. Discontinuities due to blank images should also show up, and the offending images removed from the scaling run.

Finally, the lower left window plots I/s(I) as a function of resolution for any selected image. Different parameters can be plotted by selecting the right or left pointing arrows.

53

If adjacent spots are incompletely resolved on the detector, it may be possible to improve the processing by increasing the PROFILE TOLERANCE parameters by 1-2%. These parameters can be set in the Advanced tab of the Processing Options menu (use View). Scaling the Data The idea is to scale together symmetry related intensity measurements and verify systematic absences. This is important, since it gives you the first really useful statistics on your dataset, so that you can judge their completeness, multiplicity and quality, and decide whether you should collect more data or perhaps use another crystal. Clicking on the “Quickscale” button will run Pointless followed by Scala, using a simple set of default parameters. With these, you can get a good idea of whether you have chosen the correct symmetry in indexing and how good your data are, but you may want to tweak some of the values to optimize your processing. Take a look at the output html file from the Quickscale. Find the scaling statistics table. It should look something like this:

Now construct the Table 1 you for your data (see page 61). Hint: to find the necessary Rsym (in this case equal to Rmerge) values and other info such as resolution bins etc. click [show full log file] in the pointandscale html document and scroll down (almost to the bottom) to the table shown below.

54

For explanations for the different types of R-values see Rupp pp. 412-13.

Idea: Just as there is an asymmetric unit in the unit cell of the crystal, there is an asymmetric unit in the reciprocal lattice. It is the smallest group of reciprocal lattice points that can reproduce the entire reciprocal lattice by symmetry operations. Evaluating the statistics from scaling can help you determine whether you have chosen the correct symmetry operators (space group) for the crystal.

Statistics: Measured reflections- the total number of reflections observed in your data set. In the example it is 568,530, see next page. The more the better. Unique reflections - the total number of reflections after symmetry averaging. This number is a subset of the total number of reciprocal lattice points in the asymmetric unit. If you take the number of measured reflections divided by unique reflections you get the overall redundancy of the data set. Bigger unit cells have more unique reflections. Completeness overall- The percent completeness of your data set. This measures whether you collected all the reflections in the asymmetric unit. Completeness in the last shell - The percent completeness of the highest resolution shell of your data set. You can't claim you have 1.8 Angstrom data unless this shell is fairly complete. This number keeps you honest. Rsym overall- Measures the agreement of symmetry related observations of a reflection. In this example, the symmetry related reflections agree to within 5.6%. Rsym in the last shell- Measures the agreement of symmetry related observations in the highest resolution shell. Generally, don't accept shells above 40%. I/sigma - A measure of the signal to noise ratio.

If interested, you can find more text related to integration and scaling in the textbook (Rupp chapter 8 – especially pp.419-424 relate to iMosflm and Scala)

$TABLE: Completeness, multiplicity, Rmeas v. resolution, New : $GRAPHS:Completeness v Resolution :N:2,7,8,10,11: :Multiplicity v Resolution :N:2,9,12: :Rpim (precision R) v Resolution :N:2,16,17: :Rmeas, Rsym & PCV v Resolution :N:2,13,14,15,18,19: $$ N 1/resol^2 Dmin Nmeas Nref Ncent %poss C%poss Mlplct AnoCmp AnoFrc AnoMlt Rmeas Rmeas0 (Rsym) Rpim RpimO PCV PCV0 $$ $$ 1 0.042 4.88 17383 1267 432 98.6 98.6 13.7 98.9 99.4 8.5 0.025 0.027 0.024 0.009 0.007 0.031 0.033 2 0.084 3.45 33444 2175 468 100.0 99.5 15.4 100.0 100.0 8.6 0.024 0.025 0.023 0.008 0.006 0.030 0.030 3 0.126 2.82 42710 2744 470 100.0 99.7 15.6 100.0 100.0 8.5 0.030 0.030 0.028 0.010 0.008 0.036 0.036 4 0.168 2.44 50045 3210 464 100.0 99.8 15.6 100.0 100.0 8.4 0.035 0.035 0.033 0.012 0.009 0.042 0.042 5 0.210 2.18 55683 3591 466 100.0 99.9 15.5 100.0 100.0 8.3 0.033 0.033 0.031 0.011 0.008 0.040 0.041 6 0.252 1.99 61120 3977 470 100.0 99.9 15.4 100.0 100.0 8.1 0.038 0.039 0.036 0.013 0.010 0.047 0.048 7 0.294 1.85 65005 4271 460 100.0 99.9 15.2 100.0 100.0 8.0 0.049 0.049 0.046 0.017 0.013 0.059 0.061 8 0.336 1.73 68274 4581 467 100.0 99.9 14.9 100.0 100.0 7.8 0.065 0.066 0.061 0.023 0.017 0.079 0.081 9 0.378 1.63 29145 3892 316 80.8 96.8 7.5 64.9 64.9 4.5 0.064 0.065 0.058 0.027 0.020 0.078 0.082 10 0.419 1.54 1507 771 65 15.1 84.9 2.0 4.2 4.2 1.9 0.054 0.056 0.044 0.030 0.028 0.061 0.066 $$ For inline graphs use a Java browser Overall 424316 30479 4078 84.9 84.9 13.9 80.7 80.7 7.7 0.034 0.034 0.031 0.012 0.009 0.041 0.042 Nmeas Nref Ncent %poss C%poss Mlplct AnoCmp AnoFrc AnoMlt Rmeas Rmeas0 (Rsym) Rpim RpimO PCV PCV0

56

Day 4 Scalepack2mtz

Previously, you integrated a native proteinase K dataset using iMosflm and made a ‘quickscale’ using the program Scala. For this exercise you will get a number of proteinase K datasets which have been scaled in another program called Scalepack. To continue work on these data using the CCP4 programs, we need to convert the data format to the mtz-format using the Scalepack2mtz program.

In the following exercises, when you need to type something, or click on something, it will be shown in italic. Output from the programs or text from the interface is shown in underlined.

Objective: To prepare data for work using the CCP4 package.

Idea: We will calculate determine the structure of proteinase K using the CCP4 crystallographic suite of programs. CCP4 requires a specific format for the data set called mtz (read more here: http://www.ccp4.ac.uk/html/mtzformat.html).

Procedures: First go to your eu subdirectory by opening a terminal and typing cd asb/files/eu/ (see appendix for useful linux commands). Here you will find the .sca and .log files output from scalepack (eu.sca and eu_sca.log) for the Europium derivative dataset – use the command ls to list the files/directories. Have a look at the sca.log file for both eu and sm. With the following command you will get the last 200 lines of the log file. Compare with the native data set (using 2 terminal windows will make this easier).

$ tail -200 eu_sca.log

When you're done looking through and comparing your log files go the asb/ccp4 directory.

$ cd ../../ccp4

Preparations

Start the CCP4 suite by typing ccp4i &. Select the Directories&ProjectDir. Select Add project. Type asb in the Project box and browse to find the ccp4 directory within asb. For the TEMPORY directory browse to the tmp directory in the ccp4 directory. Finally, in Project for this session of CCP4Interface choose asb. Press Apply&Exit.

Data conversion

Scalepack2mtz

1. Select the Program List module, and open the Scalepack2mtz task window.

57

2. on the first line enter a suitable job title such as Job title sca2mtz protk native data

3. Select Use anomalous data, Run old Truncate to convert intensities to structure factors and Ensure unique data & add FreeR column for 0.05 fraction of the data.

4. Browse for the In native data .sca file (in /users/username/asb/files/native/). 5. Make sure the output data go to the /users/username/asb/files/native/ directory.

In Out select Full path and change ccp4 to native. 6. Select Crystal nat belonging to project asb 7. Dataset name nat 8. Data collected at wavelength 1.5418 9. Estimated nb. of res. in the asym. unit 279 10. You should not need to change anything else. Select Run -> Run Now.

When the job has finished, return to the main window, highlight the job in the Job List, and select View Files from Job -> View Log Graphs. This task outputs a number of graphs for analysing the data, and we will just look at the Wilson plot. Check to which resolution you can regard the Wilson plot to be linear. A sharp shift in the curve at high resolution indicates the limit of the linear region. At this resolution you should set your resolution limit (in this case it seems to be 1.58 Å).

Repeat steps 2-10 for the derivative data (europium and samarium termed eu and sm, respectively) with appropriate corrections. To do that change In and Out in the scale2mtz GUI to the eu or sm directories one at the time of course. Make sure you

58

have the correct output path. You will have to change native to eu and sm, respectively. It is also important to correct the dataset name to eu and sm for the refinement programs to discriminate the columns in the merged mtz file produced using CAD, see below. Do not add a FreeR column for the derivative data.

Now the native and derivative data are in mtz format and these can now be used for the remaining programs of the CCP4 suite.

We will now merge the data sets in order to get a single mtz file containing native and derivative data columns. Also we will extend the FreeR column from the native data to the derivative data. To do this we use the program CAD.

CAD 11. Select the Experimental Phasing module, select the Data Correction submenu

and open the Merge Datasets (CAD) task window. 12. On the first line, enter a suitable job title such as:

Job title merging protk data

13. For Input file # 1 browse and insert your native .mtz file. Input all columns. 14. Select Add input MTZ file and insert your two derivative mtz files one by one.

For all Input all columns 15. Select Complete reflection list and extend freeR column FreeR_flag from file

# 1 16. Output MTZ protk.mtz 17. Select Run -> Run Now.

59

Now we have a mtz file called protk.mtz containing all the data.

A) Scaling and analysing datasets

The Problem

The file protk.mtz contains native data, plus two derivatives; Eu and Sm with their anomalous signals. First, we scale each derivative to the native dataset, so that all data are on the same scale. At the same time, we analyse the heavy atom data to estimate the strength of the signals.

Exercise

18. Select the Experimental Phasing module, and open the Scale and Analyse Datasets task window (using Scaleit).

19. On the first line, enter a suitable job title such as:

Job title Scaling protk data.

20. On the second line, select

Do scale refinement using Scaleit.

On the next line, select

Include anomalous difference data for each derivative

On the next line, select

Perform cross-comparison of derivative data sets

and de-select

and analyse anomalous differences

using the radiobuttons.

21. Select the input MTZ file

MTZ in asb protk.mtz

22. Now select the columns from the MTZ file. The first line has the native F_nat and SIGF_nat. Then select columns for the 2 derivatives, using the button Add Derivative Data to add more columns. You should only scale to the resolution limit you found from the Wilson plot (i.e. highest resolution is 1.58 Å). You should end up with:

60

Check that the output MTZ file is given as MTZ out asb protk_scaleit1.mtz

23. You should not need to change anything else. Select Run -> Run Now. 24. When the job has finished, return to the main window, highlight the job in the

Job List, and select View Files from Job -> View Log Graphs. This task outputs a large number of graphs for analysing the data, and we will just look at some of them.

25. We can gauge the strength of the isomorphous differences by looking at the graphs (for explanation see below):

Centric Normal probability v resolution and

Acentric Normal probability v resolution ...

Diso (the isomorphous difference) and Dano (the anomalous difference) are very useful analytical tools. Diso should fall off with increasing resolution, and certainly should not increase! That is a good indication of either non-isomorphism, or data quality falling off. You need to run your Pattersons with resolution ranges which only use reliable data, and with sensible EXCLUDE terms based on the plots of Diso and Dano. However MLPHARE has a built in weighting scheme which means that it doesn't do much harm to include less good data in phasing. After all the poor hkl should get low FOMs, and then DM can use the few reflections with reasonable phases to help in the phase extension procedure.

61

If there is only one derivative then the results of a normal probability analysis are also given (see Lynne Howell and Dave Smith, J.Appl. Cryst. 25 81-86 (1992)). The reflections in each resolution bin are sorted according to the value of:

delta(real) = (FPH - FP)/sqrt(SIGFPH**2 + SIGFP**2) where FPH and SIGFPH are the scaled values for the derivative. For each reflection, delta(expected) is then calculated based on an assumed normal distribution and the position of the reflection in the sorted list. A plot of delta(real) against delta(expected) is called a normal probability plot.

If the native and scaled derivative data sets are essentially identical (in statistical parlance, they represent two samplings of the same population), then the spread of the two data sets will be the same within the errors defined by SIGFP and SIGFPH, and the normal probability plot will be linear with a slope of about 1 and an intercept of 0. However, if the heavy atoms make a significant contribution to the observed structure factors, then (FPH - FP) will be larger than expected from SIGFP and SIGFPH, and the slope will be > 1. The intercept may also be non-zero.

The program plots the slope and intercept of the normal probability plot (obtained by a least squares fit) as a function of resolution for both centric and acentric reflections. These values are also plotted for the case where reflections at the tails of the distribution are excluded: these reflections tend not to lie on the straight line and distort the least squares fit. The existence and size of the heavy atom contribution to the structure factors can be gauged from the values of the slope and intercept, and the variation with resolution indicates to how high a resolution such contributions extend. A similar analysis can be applied to MAD data by assigning FP and FPH to data at different wavelengths (dispersive differences) or to F+ and F- (anomalous differences). In general, the size of the slope will be smaller in this case (http://structure.usc.edu/ccp4/scaleit.html)

for each pair of wavelengths, e.g. ... FP = F_NAT FPH = F_EU SIGF_EU , F_SM etc DANO_EU SIGDANO_EU. For each graph, look at the line Gradient_on_reflection_prob.lt.0.9. Use the crosswires to estimate a rough value, e.g. for the native against the Eu derivative, the value is about 5.1 for centric data and 4.4 for acentric data.

26. The values can be summarised as (these values are contained in the file View Files from Job -> ...scaleit.summary):

62

This shows that the isomorphous difference (i.e. difference between native and derivative) is smallest for the Eu derivative and largest for Sm. The isomorphous difference is important for phasing using e.g. Single Isomorphous Replacement. Often, it is not possible to solve the structure from SIR phasing alone and a Multiple Isomorphous Replacement (MIR) strategy or MIR combined with Anomalous Scattering (MIRAS) is applied (for a MIR/MIRAS tutorial see appendix). As we will see later Single-wavelength Anomalous Diffraction (SAD) which only needs one data set can also be used as an experimental phasing strategy for these data. We will now go on with a Molecular Replacement phasing strategy.

63

Molecular replacement (MR) We would now like to solve the structure of Proteinase K using molecular replacement. This method is very useful when no heavy atoms are present in the data set and a structural homologue is known or if you have collected a new data set of a protein with known structure but in complex with a ligand or in another conformation. To perform MR a search model is needed. This is often derived from a homologous protein with known structure. You can of course also use a model of your target protein if this is available. In this exercise we will perform MR using 1SH7 (1sh7.pdb), a serine protease showing sequence and structural homology to Proteinase K (see alignment below). This putative MR model was identified using the sequence profile based HHpred algorithm (http://toolkit.tuebingen.mpg.de/hhpred/). A trimmed version of 1HS7 will be made using the CCP4 program Chainsaw. To perform MR the powerful program PHASER (based on maximum likelihood statistics) is used. Sequence alignment of Proteinase K with 1SH7. In the consensus sequence the residues of the catalytic triad of serine proteases are indicated with red asterisks. The aligment quality is indicated below the aligment.

64

Exercise Finding the solvent content and number of molecules in asymmetric unit

27. Select the Program List module and open the Matthews_coef task window.

28. In the Matthews_coef task window, adapt the job title:

Job title Solvent content estimate

29. Insert the protk.mtz file from the ccp4 directory 30. Set resolution limit to 1.58 31. Insert: Molecular weight of protein 28906.8

32. Select the Run Now button 33. The output is 1 molecule pr. asymmetric unit, a Matthews coefficient

of 2 and a solvent content of 0.3958

Trim the MR model using CHAINSAW

34. Select the Program List module and open the Chainsaw task window 35. In the Chainsaw task window, adapt the job title:

Job title Trim MR model 1SH7

36. Prune non-conserved residues to last common atom 37. For PDB input file browse to the /asb/files/mr/ directory and select the

1SH7_A.pdb file

65

38. Change the Input sequence aligment file in to ALN/Clustalw 39. For the Alignment in input browse to the /asb/files/mr/ directory and

select the prk_1sh7.aln file 40. Select the Run Now button 41. The output file is 1SH7_A_chainsaw1.pdb

Search for molecular replacement solution using PHASER

Select the Molecular Replacement module, and open the Run Phaser task window (at the bottom of the list).

42. On the first line, enter a suitable job title such as

Job title MR using 1SH7 model

43. In the Define data folder, select:

MTZ in protk_scaleit1.mtz

F F_nat SIGF SIGF_nat

Run Phaser with the mtz space group and enantiomorph

44. Define ensembles (models)

Ensemble name ensemble1 Define ensemble via PDB file(s)

PDB #1 1SH7_A_chainsaw1.pdb

Similarity of PDB #1 to the target structure sequence identity 50.0

45. Define composition of the asymmetric unit

Component #1 protein Molecular weight 28906 Number of copies in the asymmetric unit 1

66

46. Search parameters

Perform search using ensemble1

47. Select Run -> Run Now.

48. Look at the new log file while the program is running. Check the log (likelihood) (LL) gain and Z-score (rotational search only) for each cycle. After the rotational search the Z-score should be around 9. If the Z score is above 5 the solution is significant, and if it is above 8, it is almost certain that the solution is correct. The final LL gain should be around +600. The LL gain cannot be compared for different runs with phaser (with different settings), but should be used to verify that

67

the top solution is significantly above the remaining solutions and background (i.e. significantly above zero).

49. Generation of weighted phases from the molecular replacement solution is done with the program Sigma-A. Select the Program List module and open the Sigma-A task window.

Tick the box next to Generate structure factors using SFALL and fill in your .mtz file and your PHASER solution .pdb (should be numbered according to the job list number and found in /ccp4/tmp/).

50. Use the Sigma-A .mtz file for the program DM (when you get to density modification, see p. 78) to obtain .mtz files both before and after density modification. You can look at maps from the data after you’ve been introduced to Coot (p. 80).

68

Single wavelength Anomalous Diffraction (SAD) Now you will be introduced to a new collection of programs suited for ultra fast structure determination (e.g. while still at the synchrotron...). This collection contains the programs SHELXC, D and E used for data conversion, heavy atom search and phasing combined with density modification, respectively. This is part of the CCP4 suite and we will use it to illustrate how fast a structure can be solved semiautomatically if the data are of sufficient quality.

In the CCP4 GUI Program List select the Shelx C/D/E task window

Choose a project name ProtK Eu SAD. Select SAD experiment. Insert the solvent fraction obtained from the Matthews coefficient (p. 68). Select the protk_scaleit1.mtz file as MTZ input. Select Input is in the form of structure factors. Select the Europium structure factors (F_eu) as F(+) and F(-). Change the heavy atom type to EU instead of SE in the Heavy Atom Search Parameters tab.

Select Run -> Run Now.

69

Look at the log file: highlight the job in the Job List, and select View Files from Job -> View Log File. Browse to the table of anomalous signal versus resolution. If there is no signal for a shell, the <d’/sig> and <d”/sig> should be about 0.80 – seems we have signal all the way!

Browse further down in the log file to follow the progress of ShelxD. Highlight the job in the Job List, and select View Files from Job -> View Log Graphs to get an overview. The CCall and CCweak should be equal to or above 30 and 15, respectively. A CCall of 40-50% indicates a good solution, for SAD values around 30 may be correct. In the graphics you should check that the highest values of CCall, CCweak and PATFOM are well separated from the non-solutions.

Check the log file and view log graphs for the progress of ShelxE. In the log graphs window check the connectivity vs. cycle graph. There should be a clear difference between the two hands (the correct one has highest connectivity). The connectivity should be high for a clear cut.

Now identify the file containing the phases derived from the correct hand (the protk_shelx_phs1.mtz file).

Open the CCP4i GUI and use the CAD program to merge the native structure factors in the protk_scaleit1.mtz file with the SAD phases from the .mtz file you just made (select the PHI_ori and FOM_ori columns from the protk_shelx_phs1.mtz file and all the columns from protk_scaleit1.mtz). Name the output MTZ file protk_eu.mtz and save it in the ccp4 directory. Using ARP/wARP for automatic model building (see p. 79) the full procedure of phasing and automatic model building (after integration and scaling of the data) should not take more than approximately 1½h – thats fast!!

If you are extremely lucky to have very nice data diffracting below ~1.8 Å (like in our case with proteinase K) you do not need to obtain a heavy atom derivative or incorporate Seleno-Methionine to obtain phase information. Instead you can use anomalous diffraction from Sulfur atoms in your protein (if the protein contains Methionines and Cysteines, that is).

Now try to use the protocol above and do SAD on your native dataset. First count the number of Sulfurs in proteinase K e.g. by opening the protk.pir file in emacs. Alternatively, you can go to http://www.expasy.org/ and use the ProtParam tool under the proteomics and sequence analysis tools and identification and characterization menus – use the ID code P06873 for proteinase K and press the chain 106-384 (corresponding to mature proteinase K) – check the list of amino acids and find the number of M and C. It works! Phantastic, ehh?

Go to http://skuld.bmsc.washington.edu/scatter/AS_form.html to generate theoretical plots of f’ and f’’ of individual elements. Check out Eu, Sm and S in the range around the Cu-Ka edge (hint: tick the ‘Mark X-axis at CuKa...). How big contribution (theoretically) does the individual elements have at the wavelength we used?

Forgot the theory? Go to http://skuld.bmsc.washington.edu/scatter/AS_index.html

70

Stage 3. Solvent flattening and automated model building.

51. Select the Density Improvement module and open the Run DM task window.

52. In the Run DM task window, adapt the job title:

Job title Solvent flattening

53. Deselect Input Hendrickson-Lattman coefficients

54. Select the input MTZ file:

MTZ in protk_eu.mtz (the name of the mtz from the SAD job)

Now select the columns from the MTZ file.

FP F_nat SIGFP SIGF_nat PHIO PHI_ori Weight FOM_ori

55. In the Required Parameters folder, enter the solvent content as

Fraction solvent content 0.3958.

71

56. Everything else can be left as default (the program will do solvent flattening and histogram matching), so Run -> Run Now.

57. When the job has finished, look at the log file and log graph to check statistics and solvent boundaries. To really appreciate the results, it would be best to look at the maps (one with the phases from just SAD (PHI_ori) and one with phases from DM (PHIDM)) and compare. We will go through map visualization in Coot below.

58. Select the Model Building module and open the ARP/wARP classic task window.

59. In the Arp/wARP task window, adapt the job title:

Job title Automated model building starting from experimental phases

60. Select the input MTZ file:

MTZ in protk_eu_dm1.mtz (or the name of the DM job output .mtz-file)

Now select the columns from the MTZ file.

FP F_nat SIGFP SIGF_nat PHIO PHIDM Weight FOMDM

61. Select the input sequence file:

Sequence in protk.pir (this is just a one-letter aa-sequence file format)

62. In the required parameters box input:

There are 279 residues in the AU, which belong to 1 molecule(s)

63. Select use Free R

64. Select Run -> Run Now.

65. While the program is running, you can follow the progress by opening the log file in the View Files from Job menu.

Look after the connectivity index which tells you about the extent of connectivity in the electron density and the number of chains and residues traced in the electron density. Finally, see if ARP/wARP docks the residues into the proteinaseK

72

sequence you gave it (Actually, it will build and dock approximately 270 residues of of a total of 279 residues which is FANTASTIC! – do not expect this from your first structure...)

ARP/wARP outputs a model as a PDB file (jobnumber_warpNtrace.pdb) and a mtz file: jobnumber_warpNtrace.mtz (refined by the REFMAC program using temporary models output by ARP/wARP during the auto-tracing cycles).

We will now have a quick introduction to the model building program Coot to have a look at the maps and model ARP/wARP produced (and maps from MR, sigma_A, SAD and DM). On day 5 we will go more into detail about how to build and correct the models using Coot.

Introduction to Coot Coot (Crystallographic Object-Oriented Toolkit) is a macromolecular crystallographic modelling program. It can be used to look at biomacromolecular structures, to analyse them, to compare them, to modify them and to build them from scratch (using crystallographic data). With the short time available to do these computer exercises, we can only introduce you to a few of the basic functions that Coot offers, but you can obtain more information about the possibilities in Coot here:

• http://www.ysbl.york.ac.uk/~emsley/coot/ • http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/COOT

1 – Getting started In this chapter you will learn how to run the Coot program, how to import a molecule into the program, how to display and view a molecule and how to save your data. Everything you need to type is written in bold. Make a directory named Coot in your asb directory. You can run the Coot program by typing "coot" at the terminal prompt in the Coot directory (or any directory, but to make things a bit easier use the Coot directory): $ coot <CR>

73

When you first start coot, it should look something like this Check if you have the two files: prk.pdb and prk.mtz in your asb/files/ directory. $ cd /users/username/asb/files/ <CR> $ ls <CR> If not, ask your instructors for help. The PDB file contains the coordinates of all non-hydrogen atoms of the protein proteinase K, and the mtz-file contains a list of the structure factor amplitudes and phases of proteinase K. Now you ready to open the proteinase K PDB file.

• Select “File” from the Coot menu-bar

• Select the “Open Coordinates” menu item [Coot displays a Coordinates File Selection window]

• Select prk.pdb from the “Files” list

• Click “OK” in the Coordinates File Selection window [Coot displays the coordinates in the Graphics Window]

Recenter on Different Atoms

• Select “Draw” from the Coot menu-bar

• Select “Go To Atom. . . ” [Coot displays the Go To Atom window]

• Expand the tree for the “A” chain

• Select 170 Ser in the residue list

• Click “Apply” in the Go To Atom window

• At your leisure, use “Next Residue” and “Previous Residue” (or “Space” and “Shift” “Space”in the graphics window) to move along the chain.

• Click Middle-mouse over an atom in the graphics window

[Coot recentres on that atom]

Now try navigating the structure using the mouse control (zoom, turn, translate etc.). Go to the coot wiki to get more information - http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/COOT

74

Display maps The electron density map can be calculated from the output from the refinement programs. The refinement programs stores its data (labelled lists of structure factor amplitudes and phases) in the MTZ file. Let’s take a look. . .


• Select “Auto Open MTZ” menu item [Coot displays a Dataset File Selectionwindow]

• Select the filename prk.mtz

• Press “OK”

Two maps are now displayed (2Fo - Fc and Fo - Fc). You can change the map contour level of the selected map using the mouse scroll-wheel. You will learn to change the selected map in the section. Using the Display Manager The Display manager allows you to control which of the loaded coordinates and electron density map are displayed and how they are displayed. It also allows you to change the appearance of the displayed structure (C-alpha backbone, bonds, C-alpha + ligands etc.) and map color. Explore the functions of the Display manager.

• Select “Display Manager” from the Coot menu-bar [Coot displays the Display Manager window]

• Press the pull down menu in the Display Manager window (Molecules

section), to change how the molecule is displayed

• Press “Display” or “Scroll” in Manager window (Maps section), to control which maps are displayed (Display) and what map is selected for using the mouse scroll-wheel for the contour level (Scroll)

• Explore the other functions in the Display Manager (Map color, Sigma level

etc.)

Saving you work To save the current state (loaded molecules, maps etc.).

75


• Select the “Save State..” menu item [Coot displays the Save state window]

• Input file name (e.g. myState.scm)

• Press “OK”

IMPORTANT, this only saves the state, not modified molecules. To save a molecule.


• Select the “Save Coordinates..” menu item [Coot displays the Save Coordinates window]

• Select the molecule from the pull down menu you to save

• Press “Select filename”

• Input file name (e.g. myReBuild.pdb)

• Press “OK”

The new PDB file is now saved in your coot directory. It can be advised to save your work in regular intervals, in the advent of a system crash.

76

Day 5

Model Building 2 – Model building In this chapter you will learn how to build in missing residues in the protein sequence, how to build in the main-chain manually and how to build in side chains. The prk.pdb file is an output file from ARP/WARP (but it is not the one from your own ARP/wARP run!). The ARP/WARP program has been used to automatically build most of the proteinase K protein by interpretation of the electron density map obtained from phase determination with SAD. We will have a look at this electron density map, and compare it to the protein model build from this map automatically. Usually one will go through the whole protein from the beginning to the end, and make corrections, then read out the new position of the atom in a PDB file and use different refinement programs to obtain even better atomic positions, and have a look at it again in Coot. We are not going to do the whole process, but we have found an error in the PDB file, that you have to correct. If you haven’t loaded the map (prk.mtz), load it. Center on the residue a170, and see that there is some extra electron density, which have no main chain or side chain built into it. By doing a sequence alignment of the build sequence with the protein sequence of proteinase K, we can see that a proline and an alanine residue are missing.

The ARP/WARP program have simply jumped over these two residues. To correct this will need to do the following:

77

• Add new residues in the sequence of the PDB file. • Build in mainchain of the new added residues. • Correct the mainchain and the sidechains to give them the best possible fit to

the electrondensity map. All this don with the Model/Fit/Refine controls in Coot

• Select “Calculate” from the Coot menu-bar

• Select the “Model/Fit/Refine..” menu item [Coot displays the Model/Fit/Refine window]

First you will need to make space for the new residues in the molecule. Coot cannot add a new residue if there is no gap in the chain, so first we have to renumber the last part of the chain A (171 – 275).

• Select “Calculate” from the Coot menu-bar

• Select the “Renumber Residues...” menu item [Coot displays the Renumber residues range window]

• Select molecule (prk.pdb) and Chain (A) [already selected if you don’t have other molecules/maps loaded]

• Input residues numbers “171”, “275” offset “2”

• Press “Renumber”

• Save the new coordinate file

You will now add the 2 missing residues. Go to residue 173. From the Model/Fit/Refine dialog press “Add terminal Residue..” in the graphics window click on residue 173. Coot now places the new residue where it thinks it fits according to the electron density, if you are happy press “OK” in the popup menu or drag the residue where you want it using the mouse and then press “OK”. Now add another residue to the newly added residue 172. You have now added 2 new residues both Ala, but residue 171 should be Pro. Use “Simple Mutate..” on residue 171, and change it to a Pro. Now you have added and mutated the two missing amino acids, but they do not fit the electron density very well. You will have to take care of this next. Locate residue 173 and 170 using the mouse pointer or “Go To Atom. . . ” (click on residues to label them) Select “Real Space Refine Zone” and click on residue 173 and 170. Coot now fit the residues according to the electron density, if you are happy with the fit press “OK” in the popup menu. Otherwise try to drag the residues to the desired position using the mouse (select rotate/translate zone?), and evaluate if the fit is better.

78

If the time is available, go to last amino acid 277 and start building in two missing amino acids in the end (278 Gln, and 279 Ala) or play around with different features in the Coot menu-bar. Finally, save your session (save State) and modified PDB file (save Coordinates), name it eg. newprk.pdb. Refinement The REFMAC program in CCP4 can carry out rigid body, TLS, restrained or unrestrained refinement against X-ray data, or idealisation of a macromolecular structure. The X-ray data is normally in the form of observed structure factor amplitudes, although the latest version of Refmac can refine against intensities. If good quality experimental phases are available then they can be used in addition. Refmac minimises the coordinate parameters to satisfy a Maximum Likelihood residual. REFMAC also produces an MTZ output file containing weighted coefficients for SigmaA weighted mFo-DFcalc and 2mFo-DFcalc maps, where "missing data" have been restored. The Refmac5 user interface

Refmac5 will refine an atomic model by adjusting the model parameters (coordinates, B-factors, TLS etc) in order to obtain the model which best explains the experimental data (i.e. maximises the likelihood). Progress is measured by R-factor and Free R-factor, as well as by the likelihood scores themselves.

To run Refmac, you need at minimum two pieces of information: An MTZ file containing a set of observed structure factors (and optionally phase information), and an initial atomic model from Molecular replacement or Model building.

Launch the 'Refmac5' task in the Refinement module. The task interface (shown on the right) will appear. Now select what sort of refinement you wish to perform; this will usually be 'Restrained Refinement', but you will need to select what type of data to use:

• No phase information: Use this after molecular replacement or at very high resolution.

• Phase and FOM: Use this after experimental phasing only if Hendrickson-Lattman coefficients are not available.

• Hendrickson Lattman coefficients: Use this after experimental phasing.

Enter the names (or browse) for the two files as follows:

• The MTZ file containing your observed structure factors. Enter this in the field labelled 'MTZ in'. You will need to select the MTZ columns for the observed structure factor magnitude (F) and its standard deviation (sigma). If you elected to use phase information, then you should also select the best phase and figure of merit or Hendrickson Lattman coefficients output from the phasing program.

79

• The initial atomic model from molecular replacement or model building. Enter this in the field labelled 'PDB in'. You should check that Refmac5 understands the chemistry of the contents of the PDB file by running Review Restraints.

Running Refmac5 Exercise

65. From the Refinement module select Run Refmac5. 66. Enter a suitable job title such as

Job title restrained refinement for protK

Then

Do restrained refinement using no prior phase information input

Click Run Coot:find waters to ... on

67. Now select the input files - the experimental data: MTZ in DATA protk_scaleit1.mtz

and make sure you have correct data columns:

FP F_nat SIGFP SIGF_nat

and the coordinate file:

PDB in DATA newprk.pdb

68. Click on Run -> Run Now.

69. The job will take a little time. When it is finished look at the log file (click on the name of the job, refmac5, in the main window and use View Files from Job and View Log Graphs).

Go to the last table in the Tables in File and click on:

Rfactor analysis, stats vs cycle

You will see a graph of the R factor and the Free R factor for the cycles of refinement. The R factor is very good already but both go down a little.

80

Also look at the Graphs in Selected Table for:

FOM vs cycle -LLG vs cycle Geometry vs cycle

The FOM tells you how well the molecule matches the experimental data and the Geometry tells you how well the molecule obeys the geometry restraints.

Also, slightly up the Tables in File list, select the last cycle:

Rfactor analysis, F distribution v resln

This is information about the last cycle of refinement. Have a look at:<Rfactor> v. resln. Which R factors did you end up with (also check the log file)? Is it good or bad for this resolution? What about the FOM?

The red line is the average R factor versus resolution for the data which is used and the green line is the Free R factor (for the 'free' data which is not used). This is similar across the resolution ranges - it does not go up for high resolution data. This is an example of what is good about maximum likelihood refinement compared with the old-fashioned least squares.

Also look at the graph:

81

<Fobs> and <Fc> v. resln

This is a graph of the average observed structure factors and calculated structure factors. You notice that at low resolution the observed (red) and the calculated (blue) are not the same. At low resolution the water atoms, which we can not see in the crystal structure, are an important part of the structure factors. The refinement program tries to model the water atoms by solvent scaling but it is difficult for this data because some of the very low resolution data is missing.

70. Look at the header of the output MTZ file - click on View Files from Job and select the file newprk_refmac1.mtz. If you have problems opening the mtz file header you can use the following command from the terminal window: $ mtzdump hklin newprk_refmac1.mtz <CR> and the write run and press enter. In the file you will see: * Column Labels : H K L FNAT SIGFNAT FreeR_flag FC PHIC FWT PHWT DELFWT PHDELWT FOM

You may see more columns in your mtz file. The new data in the file is:

FC & PHIC the structure factors and phases calculated from the final coordinates

FWT & PHWT the 'best' structure factors and phases weighted by the maximum likelihood function

DELFWT & PHDELWT the 'best' structure factors and phases for a difference map

FOM figure of merit for PHIC

If you selected the option to create output maps then you can look at the maps created from the REFMAC output (or just use the mtz files in Coot)

...FWT.map the 'best' weighted map

...DELFWT.map the 'best' weighted difference map

82

An example of these maps is shown below for a tyrosine residue which is in the wrong place. The DELFWT map is the weighted difference map of F(observed) - F(calculated) and looks like this:

Here you can see a large pink area of negative density where the tyrosine side chain is now. This is saying that the side chain should not be here. The large brown-red area of positive density is showing where the side chain should be.

The FWT map is the weighted map and looks like this:

You can see region of density to the left of the tyrosine which is where it should go.

83

More about the Refmac5 program output

Double-click your Refmac job in the CCP4 task list to view the log file. This contains both crystallographic statistics and warning messages. Use the 'Find' button to search for the word 'warning' to identify any problems with the input model.

To judge the quality of the refinement model, click 'Show log graphs' and scroll down to the last graph in the list, titled 'Rfactor analysis, stats vs cycle' (see the example on the right). This shows the variation in R-factor and Free R-factor against cycle number. The R-factor and Free R-factor should decrease as the calculation proceeds.

Check the difference between R-factor and Free R-factor. If the Free-R factor is much higher than the R-factor, then you do not have enough data to support the level of detail you are trying to refine. You will need to reconsider the model you are using, for example try using isotropic or overall B-factors instead of anisotropic.

At the end of the log file is a useful table listing R / Rfree, -LL / -LLfree (LL is the log likelihood) and the deviation from ideal geometry against refinement cycle. zBOND and zANGL are the Z-scores of the deviations of bond lengths and angles from ideal values: at high resolution these should approach 1.0 while at low resolution, where structures are tightly restrained to ideal values, they should approach 0.0.

Validation The aim of this chapter is to show you how to validate your finished protein structure. You will be introduced to the ccp4i package program PROCHECK and BAVERAGE. PROCHECK is a program which checks the geometry of the model – both for each amino acid in the model and for the entire molecule as a whole. The idea is to show the areas that might contain errors so that the crystallographer can check these manually and correct the geometry if it is wrong. Ligands and so forth can distort the geometry so that even if PROCHECK reports an area as deviating from the norm, this might not actually represent an error. To run the program we first need to write out our atom positions from Coot. After the last modification, we can do (or already did) that with the command: save coordinates in the File menu in Coot and choose the name which we like to call our new file, and the modified molecule we like to write out. We have already done this earlier in this exercise (newprk.pdb, see p. 87). Use your refined structure. Then in a new terminal window open the ccp4 package program. $ ccp4i <CR> In the program list, you will find the PROCHECK program. Open it, and add the coordinate-file: newprk.pdb and the mtz-file: prk.mtz. Click Run Rampage, Procheck and Sfcheck onWhen the program has finished running, check the log file and the log

84

graphs, especially the Ramachandran plot and the Chi1-Chi2 plot. Use a graphics program like the Gimp to view the postscript files output from the Procheck (locate the files in the output folder assigned in Procheck).

1) The Ramachandran plot (newprk_rampage1.ps) shows the phi-psi torsion angles for all residues in the structure (except those at the chain termini). Glycine residues are separately identified by triangles as these are not restricted to the regions of the plot appropriate to the other sidechain types. The colouring/shading on the plot represents the different regions: the darkest areas (shown in red) correspond to the "core" regions representing the most favourable combinations of phi-psi values. Ideally, one would hope to have over 90% of the residues in these "core" regions. The percentage of residues in the "core" regions is one of the better guides to stereochemical quality.

• How many residues are in the most favourable region? • Which residues lies outside this region and what are the implications?

2) The Chi1-Chi2 plots show the chi1-chi2 sidechain torsion angle combinations for all residue types whose sidechains are long enough to have both these angles. The shading on each plot indicates how favourable each region on the plot is; the darker the shade the more favourable the region. The data on which the shading is based on a data set of 163 non-homologous, high-resolution protein chains chosen from structures solved by X-ray crystallography to a resolution of 2.0Å or better and an R-factor no greater than 20%. The numbers in brackets, following each residue name, show the total number of data points on that graph. The red numbers above the data points are the reside-numbers of the residues in question (ie showing those residues lying in unfavourable regions of the plot).

If you are going to publish a structure you would usually show the percentage of residues which were found in the favoured/allowed/generously allowed/ disallowed regions of the ramachandran plot together with your other crystallographic data. These values can be found in the bottom of the log file. Furthermore you would also publish the overall B-factor calculated from your protein structure. Before you started building your protein model, you obtained an overall B-factor, calculated from your reflection file. This was obtained in the Wilson plot, when you ran the program TRUNCATE. This B-factor needs to be compared with your final overall B-factor, which can be calculated with BAVERAGE, which calculate the averages B over main and side chain atoms in your coordinate file. Open BAVERAGE, which you will find in the PROGRAM LIST, and enter the newprk.pdb file (or your refined structure). View logfile. The overall B-factor you will find approximately in the middle of the log file (just after a long list of values). The two B-values should be approximately the same. If the final B-factor is higher than the Wilson B-factor you probably built to much model into the electrondensity. If the final B-factor is much lower than the Wilson B-factor, you probably missed something in the protein structure.

85

You have now learned two methods to validate your final protein structure. You can also use the Validate menu in coot to display Ramachandran maps, geometry and rotamer analysis etc. Use this feature to have a look at your own maps and models from SAD or MR. Also, make anomalous maps using FFT in CCP4i. Select Run FFT to generate anomalous map in the FFT task window. Using f.ex. DANO_EU to calculate the anomalous peaks in the Eu dataset. Also look at the anomalous peaks in the native dataset.

Look at the anomalous maps in Coot (use 'open map' in the File menu) and load the model and the electrondensity map from your own ARP/wARP run. Compare and evaluate the anomalous peaks. Can you explain all the high peaks (sigma above approximately 3.5) in the maps? How well does the peaks agree with your model? Do you agree that anomalous maps are powerful validation/analysis tools in structure determination? Also, visit the MOLPROBITY website: http://molprobity.biochem.duke.edu/ (very nice validation site). Follow the instructions on the website (click Evaluate X-ray structure)

87

Day 6 Structure analysis When you have finished model building and validation, the next stage is to analyse your structure. Some basic types of analyses are normally applicable to all new structures. These include: • Analysis of surface electrostatics • Analysis of conserved residues on the surface of the molecule • Structural comparison to other models in the PDB database

We will now do these analyses and others on our proteinase K model. Electrostatics 1. Go to: http://nbcr-222.ucsd.edu/pdb2pqr_1.8/

Upload your PDB file and mark the field "Create an APBS input file" You can check out the http://apbs.sourceforge.net website for more info about APBS.

2. Download the .pqr file (save as 2PKC.pqr) and copy the content of the .in file (APBS input file) into emacs and save as apbs.in

3. Edit the APBS input file in emacs so it looks like the following example (don’t

correct the numbers):

read mol pqr 2PKC.pqr end elec mg-auto

dime 161 161 97 cglen 110.9114 111.6849 69.2699 fglen 85.2420 85.6970 60.7470 cgcent mol 1 fgcent mol 1 mol 1 lpbe

bcfl sdh pdie 2.0000

sdie 78.5400 srfm smol chgm spl2 sdens 10.00

srad 1.40 swin 0.30 temp 298.15 gamma 0.105 calcenergy total

calcforce no write pot dx 2PKC end

88

It is important to add "write pot dx ..." because otherwise a map will not be printet out. Run the script in the terminal window by typing: $ apbs apbs.in <CR> 4. Load the PDB file and map into PyMOL (for PyMOL tutorial see appendix) and

color the surface according to the map.

This is done according to the following example (but with your own files) by typing at the command line:

load 2PKC.pdb, 2PKC load 2PKC-PE0.dx, map ramp_new pot, map, [-25,0,25] set surface_color, pot, 2PKC show surface, 2PKC Try to change the +/- values in "ramp_new pot, map, [-25,0,25]". What can you say about the surface electrostatics of proteinase K? Conservation Go to the ConSurf server: http://consurftest.tau.ac.il/ Click where appropriate (as a first test answer that you don’t have an alignment and use default settings). Display your conservation in PyMOL (hint look in the bottom of the result webpage of ConSurf). You can look at both the a) and b) PDB file option. For explanation of the coloring scheme go back to the ConSurf web site and click overview. On the overview page click coloring scheme and read on. You can change the coloring scheme manually by editing in the consurf_new.py script using the RGB color code (but don’t spend time on this now). The default coloring scheme is:

Variable Conserved If time allows it try uploading an alignment. The alignment in FASTA format can be found in the files directory (prk_algn.fas). Also take a look at another server: http://www.mcgnmr.ca/ProtSkin/intro/ Dali

89

Other important aspects in analysing your structure of course includes checking how well the model agrees with existing biochemical and mutational data. The Dali server is a network service for comparing protein structures in 3D. You submit the coordinates of a query protein structure and Dali compares them against those in the Protein Data Bank. A multiple alignment of structural neighbours is emailed back to you. In favourable cases, comparing 3D structures may reveal biologically interesting similarities that are not detectable by comparing sequences. If you want to know the structural neighbours of a protein already in the Protein Data Bank (PDB), you can find them in the Dali Database. The Dali Domain Dictionary is a numerical taxonomy of all known structures in the PDB. Go to http://ekhidna.biocenter.helsinki.fi/dali_server/ Upload your latest PDB file to the server and submit the request. You may get queued or maybe you are just too impatient to wait for the result. If so, click on the link to the Dali Database. Here you enter a PDB identifier for a known protein (e.g. 2PKC in your case) and submit. Click on 2pkcA and look at the list that appears (see below for explanations of the columns or read the explanations on the web-site). If you would like to get a quick look at the overlay between proteinase K and another protein from the list you just tick the box on the left, go to the top and start the Jmol Applet. When Dali has finished you receive the result by e-mail – it is a looong list. The column format for the first part (SUMMARY) is: No: Chain Z rmsd lali nres %id PDB Description The first column is the rank of the structural alignment based on Z score with the structure with highest homology to proteinase K having number 1. If you click on the number you go to the structure based pairwise sequence alignment. The second column is the specific chain from the aligned structure that has been superimposed on proteinase K. If you click on it you go to the Dali list for this protein chain. The third column is the Z-score (pairs with Z<2 are structurally dissimilar) and the fourth column is the RMSD in Å. The fifth column is the length of the alignment (in amino acids) and the sixth column is the length of the aligned structure or neighbour (in amino acids). The seventh column is the overall sequence identity in the alignment. The last two columns are a link to the PDB file of the structure superimposed and a description of the protein/molecule, respectively. Go to the Pairwise Structural Alignment section (scroll down). Here the aligned residues (superimposed fragments) can be seen. Check out which structural

90

homologues proteinase K has and which regions of the model that are especially widespread in other structures. Any surprises? Other analyses Can you think of other interesting things to analyse regarding the proteinase K structure? For instance take a look at (using a number of the programs you have been introduced to...):

- Crystal packing - Anomalous sites/ion sites - The hydration layer - The secondary structure - Function: can you find homologous structures with

inhibitors/substrates bound? Try to explain the function/catalytic mechanism from analysis of these structures and your structure of proteinase K. Discuss the relevance of analysing the above aspects of your crystal structure.

Exercise Write down some lines (approximately ½-1 page) where you sum up your analysis and discuss the data and hand it in to your instructors.

91

Docking The following is a docking tutorial using the Situs package (http://situs.biomachina.org/). For structural analysis of a number of biologically important complexes it is necessary to combine a number of structural methods to obtain a complete structural description. These methods could be X-ray crystallography, Small-Angle X-ray Scattering (SAXS) and cryo-Electron Microscopy (cryo-EM). A way to combine these methods could be to dock crystal structures into maps produced by SAXS or cryo-EM.

Correlation-Based Docking Tutorial

This tutorial teaches new users the application of the exhaustive 6D seach and FFT-accelerated docking program colores for rigid-body docking. Colores provides options for contour-based enhancement of the fitting contrast by means of a Laplacian filter that can also be turned off if desired. The results of this tutorial can be compared to solutions distributed with the tutorial software. More documentation is available in the user guide, on the methodology page, and in the published articles

The user can compare all generated files to the files in the "solutions" directory. Here is an overview of the fitting scenario. In the following, we will use the first two files for docking:

92

Data Flow and Design

The series of steps and the programs that are required to use colores for the docking of an atomic-resolution structure to low-resolution data are shown schematically in the following figure. Detailed program explanations are given in the Situs user guide

Schematic diagram of colores related routines. Major Situs components (light blue) are classified by their functionality. The main work flow is indicated by brown arrows, and the optional inspection of Euler angles in dark blue arrows. The visualization (orange) for the rendering of the models requires a molecular graphics viewer (we use here the VMD graphics program, Chimera and Sculptor also support Situs format).

Standard EM formats are supported and are converted to cubic lattices in Situs format. This is done with the map2map utility. Subsequently, the data is inspected and, if necessary, prepared for the fitting using a variety of visualization and analysis tools. All Situs tools require one volume and one PDB structure for the fitting. Atomic coordinates in PDB format can be transformed to low-resolution maps, if necessary, and vice versa, to allow docking of maps to maps or structures to structures. The resulting docked complex can be inspected in the graphics program. Also, if a subset of Euler angles is chosen these can be inspected after conversion into PDB format with the eul2pdb tool.

Running colores

For this particular docking case it will be sufficient to perform a reduced angular search (sampled at 20°-> option -deg 20) to restore the original atomic structure that we used to generate the 15Å simulated EM map (target resolution 15Å -> option -res 15). After the exhaustive search is done, the best 6 on-lattice maxima (option -explor

93

6) will be refined (off-lattice) using Powell optimization. Here is the command that runs this search (don’t type ./ - only when running in Situs directory):

./colores 0_hexamer.situs 0_monomer.pdb -res 15.0 -deg 20 -explor 6 When starting the program (by default as a single processor run), it will first display the user assigned and default assigned options. Here you can check if all the input options that you requested were understood, and if you agree with the default assignment of the other options (the user assigned options are marked in blue). For example, at 15Å resolution the program uses by default the Laplacian filter that enhances the fitting contrast, and assigns a density threshold of 0.0, i.e. only positive values will be considered:

colores> Options read: colores> Target resolution 15.000 <== -res 15 colores> Resolution anisotropy 1.000 colores> Low-resolution map cutoff 0.000 colores> Laplacian filtered correlation colores> FFT grid size expansion factor 0.200 (thickness of additional zero layer as fraction of map dimensions) colores> Euler angles generation using Proportional method colores> Angular sampling accuracy 20.000 <== -deg 20 colores> Euler angle range: [0.000:360.000] [0.000:180.000] [0.000:360.000] colores> Sculptor mode OFF colores> Number of best fits explored 6 <== -explor 6 colores> Original peak search by sort and filter colores> Powell maximization ON colores> Powell tolerance 1.00E-06 Max iterations 25 colores> Powell trans & rot initial step sizes set to default values colores> Powell correlation algorithm determined automatically colores> Peak sharpness estimation ON colores> Number of SMP processors requested: 1 Then the input files will be read. You can check if the map parameters, as well as the size of the atomic structure, are as expected:

colores> Processing low-resolution map. lib_vio> Situs formatted map file 0_hexamer.situs - Header information: lib_vio> Columns, rows, and sections: x=1-43, y=1-41, z=1-29 lib_vio> 3D coordinates of first voxel: (-84.000000,-80.000000,-68.000000) lib_vio> Voxel size in Angstrom: 4.000000 lib_vio> Reading density data... lib_vio> Volumetric data read from file 0_hexamer.situs lib_vwk> Setting density values below 0.000000 to zero. lib_vwk> Remaining occupied volume: 51127 voxels. lib_vwk> Map size changed from 43 x 41 x 29 to 37 x 35 x 23. lib_vwk> New map origin (coord of first voxel): (-72.000,-68.000,-56.000) lib_vwk> Map density info: max 5.881478, min 0.000000, ave 0.709373, sig 1.280587. _____________________________________________________________________________ colores> Processing atomic structure. lib_pio> 303 atoms read. colores> Geometric center: -12.369 -36.951 -9.252, radius: 40.526 Angstrom Next, the Gaussian filter that is used for lowering the resolution of the atomic structure, and the Laplacian filter that is used to add the contour information to the fitting criterion, are generated:

lib_vwk> Generating Gaussian kernel with 7^3 = 343 voxels. lib_vwk> Generating Gaussian kernel with 11^3 = 1331 voxels. lib_vwk> Generating Laplacian kernel with 3^3 = 27 voxels. lib_vwk> Generating kernel with 9^3 = 729 voxels. lib_vwk> Map size expanded from 37 x 35 x 23 to 61 x 57 x 41 by zero-padding. lib_vwk> New map origin (coord of first voxel): (-120.000,-112.000,-92.000) colores> Identifying inside or buried voxels and creating flipped mask... colores> Found 11957 inside or buried voxels (out of a total of 142557). colores> Identifying inside or buried voxels... colores> Found 11957 inside or buried voxels (out of a total of 142557). colores> Memory allocation for FFT. colores> FFT planning...

94

As a test, the correlation is calculated with the probe structure centered in target density map. In this case, since the low resolution map has a "hole" in the center, the Laplacian correlation yields negative values:

colores> Testing the maps and correlations. colores> Projecting probe structure to lattice... colores> Low-pass-filtering probe map... colores> Target and probe maps: lib_vwk> Map density info: max 5.881478, min 0.000000, ave 0.148212, sig 0.675602. lib_vwk> Map density info: max 5.614532, min 0.000000, ave 0.025529, sig 0.279312. colores> Projecting probe structure to lattice... colores> Applying filters to target and probe maps... lib_vwk> Relaxing 5 voxel thick shell about thresholded density... colores> Normalizing target and probe maps... colores> Target and probe maps: lib_vwk> Map density info: max 9.253166, min -12.816330, ave -0.000063, sig 0.560102. lib_vwk> Map density info: max 16.796933, min -30.926755, ave -0.004420, sig 0.472828. colores> Writing target and probe maps for inspection or debugging... lib_vio> Writing density data... lib_vio> Volumetric data written to file col_hi_fil.sit lib_vio> File col_hi_fil.sit - Header information: lib_vio> Columns, rows, and sections: x=1-61, y=1-57, z=1-41 lib_vio> 3D coordinates of first voxel: (-120.000000,-112.000000,-92.000000) lib_vio> Voxel size in Angstrom: 4.000000 lib_vio> Writing density data... lib_vio> Volumetric data written to file col_lo_fil.sit lib_vio> File col_lo_fil.sit - Header information: lib_vio> Columns, rows, and sections: x=1-61, y=1-57, z=1-41 lib_vio> 3D coordinates of first voxel: (-120.000000,-112.000000,-92.000000) lib_vio> Voxel size in Angstrom: 4.000000 colores> Computing correlation between maps in direct space... colores> Correlation with structure centered in density map: -4.2820480E-02

95

colores> Computing correlation in Fourier space... colores> FFT correlation with structure centered in density map: -4.2820480E-02 Now the uniform distribution of the Euler angles that covers the rotational search space is computed. The triplets of Euler angles are saved in the file col_eulers.dat. This file can be edited or can be used again at other times when you perform a search with idenical sampling:

colores> Getting Euler angles. lib_eul> Proportional Euler angles distribution, total number 1908 (delta = 20.000000 deg.) colores> Total number of orientations sampled: 1908 colores> Euler angles saved in file col_eulers.dat.

Here follows the system-dependent time estimate for the 6D on-lattice search:

colores> Time of one FFT calculation: 23.093000 ms colores> Average time spent on each rotation: 47.894200 ms colores> Estimated time for full 6D (on-lattice) search: 0 h 1 m 31 s colores> Off-lattice Powell optimization will take significant extra time.

Then the 6D on-lattice search is performed. During this search progressive information about the best translational fit for each orientation is written to the file col_rotate.log. A progress bar keeps you informed:

colores> Starting 6D on-lattice search with 3D FFT scan of Euler angles. colores> Searching using 1 processors colores> |##################################################| 1908/1908 | 100% done colores> Actual time spent on 6D on-lattice search: 0 h 1 m 33 s

Next follows a peak search of the maximal correlation values, after which the program enters the Powell off-lattice optimization of the selected (-explor) 6 highest scoring maxima:

colores> Translation function peak detection. colores> Peak filter contrast: maximum 0.924312, sigma 0.064743 colores> Contrast threshold: 0.129486, candidate peaks: 185 colores> Found 72 non-redundant peaks. _____________________________________________________________________________ colores> Off-lattice search (Powell's optimization method). colores> Determining most efficient correlation algorithm based on convergence and time... colores> Original algorithm: Correlation = -0.01498504 Time = 1.498000 ms colores> Masked algorithm: Correlation = -0.01506857 Time = 1.376500 ms colores> One-step algorithm: Correlation = -0.01498504 Time = 1.105000 ms colores> Using one-step correlation function.

96

colores> Shown are: offset (in A) from reference center (2.000,2.000,-10.000), colores> Euler angles (in degrees), and correlation value. colores> colores> Performing optimizations... colores> colores> Powell optimization for score maximum no. 1. colores> X Y Z Psi Theta Phi Correlation colores> 24.000 -32.000 0.000 0.000 0.000 300.000 3.8181654E-01 Initial colores> 23.718 -31.115 0.621 -0.795 0.132 300.007 3.9282045E-01 1 colores> 23.888 -31.124 0.635 -0.814 0.160 300.008 3.9303528E-01 2 colores> 23.887 -31.124 0.633 -0.814 0.160 300.008 3.9303573E-01 3 colores> 23.887 -31.124 0.633 -0.814 0.160 300.008 3.9303573E-01 4 colores> 23.887 -31.124 0.633 359.186 0.160 300.008 3.9303573E-01 Final colores> colores> Powell optimization for score maximum no. 2. colores> X Y Z Psi Theta Phi Correlation colores> -28.000 28.000 0.000 0.000 0.000 120.000 3.8181517E-01 Initial colores> -27.718 27.115 0.621 -0.795 -0.132 120.007 3.9281792E-01 1 colores> -27.888 27.124 0.635 -0.814 -0.160 120.008 3.9303254E-01 2 colores> -27.887 27.124 0.633 -0.814 -0.160 120.008 3.9303300E-01 3 colores> -27.887 27.124 0.633 -0.814 -0.161 120.008 3.9303301E-01 4 colores> -27.887 27.124 0.633 179.186 0.161 300.008 3.9303301E-01 Final colores> colores> Powell optimization for score maximum no. 3. colores> X Y Z Psi Theta Phi Correlation colores> -40.000 -8.000 0.000 0.000 0.000 60.000 3.6045063E-01 Initial colores> -40.280 -9.766 0.644 -0.345 -0.085 60.000 3.8385660E-01 1 colores> -40.330 -9.824 0.643 -0.282 -0.052 60.000 3.8393576E-01 2 colores> -40.338 -9.814 0.644 -0.253 -0.047 60.000 3.8394305E-01 3 colores> -40.366 -9.796 0.647 -0.152 -0.031 60.000 3.8396217E-01 4 colores> -40.367 -9.795 0.647 -0.149 -0.031 60.000 3.8396229E-01 5 colores> -40.367 -9.795 0.647 179.851 0.031 240.000 3.8396229E-01 Final colores> colores> Powell optimization for score maximum no. 4. colores> X Y Z Psi Theta Phi Correlation colores> 36.000 4.000 0.000 0.000 0.000 240.000 3.6044767E-01 Initial colores> 36.280 5.766 0.644 -0.345 0.085 240.000 3.8385590E-01 1 colores> 36.330 5.824 0.643 -0.282 0.052 240.000 3.8393511E-01 2 colores> 36.338 5.814 0.644 -0.254 0.047 240.000 3.8394234E-01 3 colores> 36.367 5.796 0.647 -0.151 0.031 240.000 3.8396147E-01 4 colores> 36.367 5.795 0.647 -0.149 0.031 240.000 3.8396158E-01 5 colores> 36.367 5.795 0.647 359.851 0.031 240.000 3.8396158E-01 Final colores> colores> Powell optimization for score maximum no. 5. colores> X Y Z Psi Theta Phi Correlation colores> -16.000 -40.000 0.000 0.000 0.000 0.000 3.5694334E-01 Initial colores> -14.184 -38.973 0.621 -0.535 -0.365 -0.088 3.9651018E-01 1 colores> -14.301 -39.025 0.653 -0.535 -0.463 -0.141 3.9669580E-01 2 colores> -14.296 -39.054 0.662 -0.535 -0.521 -0.158 3.9670773E-01 3 colores> -14.292 -39.075 0.664 -0.535 -0.531 -0.171 3.9671264E-01 4 colores> -14.292 -39.075 0.666 -0.535 -0.533 -0.171 3.9671285E-01 5 colores> -14.292 -39.075 0.666 179.465 0.533 179.829 3.9671285E-01 Final colores> colores> Powell optimization for score maximum no. 6. colores> X Y Z Psi Theta Phi Correlation colores> 12.000 36.000 0.000 0.000 0.000 180.000 3.5694033E-01 Initial colores> 10.184 34.973 0.621 -0.535 0.365 179.912 3.9650980E-01 1 colores> 10.301 35.025 0.653 -0.535 0.463 179.859 3.9669525E-01 2 colores> 10.296 35.053 0.662 -0.535 0.521 179.841 3.9670722E-01 3 colores> 10.292 35.075 0.664 -0.535 0.532 179.829 3.9671214E-01 4 colores> 10.292 35.076 0.666 -0.534 0.533 179.828 3.9671237E-01 5 colores> 10.292 35.076 0.666 359.466 0.533 179.828 3.9671237E-01 Final colores> colores> Powell optimization time (6 runs): 10.687460 s As we will see later, the six saved fits correspond to the symmetry-related placement of the monomer into the hexameric density. The peak sharpness is estimated for every solution (can be turned off) and the found highest scoring results are written to PDB files:

colores> Renormalizing correlation values by highest score. colores> Writing translation function lattice to Situs file. lib_vio> Writing density data... lib_vio> Volumetric data written in Situs format to file col_trans.sit lib_vio> Situs formatted map file col_trans.sit - Header information: lib_vio> Columns, rows, and sections: x=1-61, y=1-57, z=1-41 lib_vio> 3D coordinates of first voxel: (-120.000000,-112.000000,-92.000000) lib_vio> Voxel size in Angstrom: 4.000000

97

colores> Writing translation function lattice information to log file. _____________________________________________________________________________ colores> Saving the best results. colores> Estimating peak sharpness and writing best fit no. 1 to file col_best_001.pdb. colores> Estimating peak sharpness and writing best fit no. 2 to file col_best_002.pdb. colores> Estimating peak sharpness and writing best fit no. 3 to file col_best_003.pdb. colores> Estimating peak sharpness and writing best fit no. 4 to file col_best_004.pdb. colores> Estimating peak sharpness and writing best fit no. 5 to file col_best_005.pdb. colores> Estimating peak sharpness and writing best fit no. 6 to file col_best_006.pdb.

Finally, in addion of the output file coordinates a number of output log files (described in the next section) are saved:

colores> Output files: col_best*.pdb => Best docking results in PDB format with info in header col_eulers.dat => colores-readable list of Euler angles col_rotate.log => Rotation function (unnormalized) log file col_trans.log => Translation function (norm. by best fit) log file col_trans.sit => Translation function (norm. by best fit) in Situs format col_lo_fil.sit => Filtered target volume in Situs format, just prior to correlation calculation col_hi_fil.sit => Filtered (and centered) probe structure in Situs format, just prior to correlation calculation col_powell.log => Powell optimization log file

Inspect the results of your docking tutorial using PyMOL (see appendix) or UCSF Chimera (http://www.cgl.ucsf.edu/chimera/).

99

Appendix

100

Making pretty pictures

There are several things to be aware of when creating graphics for publications and screen. With PyMOL, the process is intuitive and easy if you remember a few basic things outlined in the following. Basically, the process can be broken down into 3 steps:

1. Display and orient the molecule in the way you want it to appear in the figure

2. Make settings to the display that are compatible with the medium in which the

image will appear

3. Render the figure and save it to a file

Working with several views One feature in PyMOL, which can be useful in the process of generating figures, are the scenes. Using scenes it is possible to store up to 12 different views of the present molecule and quickly recall these by the use of the F1-F12 keys (on a PC). To store the present view as a scene, select one of F1, F2, ..., F12 from the Store submenu in the Scene main menu in PyMOL. To later recall one of these views, simply press the corresponding F-key, or select one of F1, F2, ..., F12 from the Recall submenu in the Scene main menu. PyMOL will now smoothly rotate and zoom the molecule to the stored view. This makes it possible to store several interesting views while working with the figure(s), which one can then return to later.

The next steps depend on whether the figure is going to be used in print or online on a computer, the web, or in a presentation. For printed and some online figures it is most convienient to use a white background as many printers have difficulties reproducing a black background properly. But the choice of background colour also affects the other colours used in the figure,

as these obviously cannot be too light if the background is white, so it is generally adviced to set the background to white early on while working with views and colours to get the best impression of the final figure. To set another background colour than the default black one, simply select this from the Background submenu in the main Display menu. (or with a command line: bg_color white)

101

Colour space Perhaps a more technical issue, which becomes very important when producing figures for print, is the choice of colour space. Computer screens, LCD projectors and the like produce colours by the additive colours scheme, in which a mixture of 3 colours (RGB = red, green, and blue) gives white. This happens because such devices use light of different wavelengths to generate different colours, and a mixture of all visible wavelengths of light gives white light. In contrast, printed colours arise by the subtractive colour scheme in which the mixture of 3 colours (CMYK = cyan, magenta, and yellow) gives black. In this case, the colours are being produced by mixing inks of different colour. The upshot of all this is that the different physical ways in which colours are produced means that no device, whether a screen, LCD projector, or printer, is able to display all possible colours, and the colour space of a given device is referred to as its colour space or gamut.

This may seem very technical, but the fact is that if a figure intended for print is produced using the standard RGB colour scheme used on the screen, it is very likely that not all colours can be reproduced

correctly in print, some colours will in other words be "out of gamut". In particular, the three primary RGB colours, red, green, and blue, are particularly problematic but unfortunately often the easy first choice for figures as they will be among the first listed colour possibilities.

Fortunately, there are several features in PyMOL which can help avoiding colour space problems. First, several very useful colours are available under each main colour, for example, one can select the non-primary colours tv_red, raspberry, and darksalmon in the reds section of the colour menu (available by clicking C next to

the object) and avoid the primary colour, red.

Secondly, PyMOL is able to work with the CMYK colour scheme on the screen which ensures that all colour tones are compatible with printing. So, whenever working with figures for print, make sure that you select CMYK (for publications) from the Colour Space submenu in the main Display menu. You will immediately notice that the colours change overall and become more subdued. This shouldn't be a cause for concern as the colours are now being restricted to lie within the gamut of both the RGB and CMYK colour schemes (since the screen is not able to display all CMYK colours) and everything should look fine once the figure is printed.

102

Rendering the image The final stage in producing your high quality image is the rendering process (also called "ray-tracing"), which is a computationally intense procedure that results in a life-like, almost photo-realistic representation of your molecule. Anyone can see immediately that the rendered image is a lot "nicer" than the non-rendered image, but this niceness can be broken down to a few central features:

• lighting - advanced lighting is applied, for example is it possible for part of the structure to cast shadows on other parts

• anti-aliasing - high-contrast edges, for example between the molecule and the white background, are evened out by an advanced colour gradient method.

• fog (optional) - fog may be used to fade parts of the structure in the background and thereby create focus on other parts

To render (ray-trace) the current view in PyMOL, simply click the button named Ray or type ray at one of the command prompts.

Finally, save your high-quality image by selecting Save Image... from the main File menu (be careful not to click anywhere inside the image as this will revert to the non-rendered image).

Exercise

Now make pretty pictures of the overall proteinase K structure in a favorable position and the active site from a nice point of view. Also make some interesting pictures of the surface electrostatics and conserved surface residues. SAVE the pictures and print them out with an explanatory legend. Hand your figures in to your instructors for comments.

103

Introduction to

Pymol is one of the best programs for making pretty pictures of a protein of interest. PyMOL is a versatile program with many build in functions. It has both a GUI and command line interface. Using the command line interface it is possible to make scripts for performing complex tasks. If you like to download this program to you own computer you can go to the website: pymol.sourceforge.net and click the link "Download" and follow the instructions. Go to the ccp4 directory. By typing: cd asb/ccp4/ from your main directory. Open PyMOL by typing:

$ pymol

Two windows are created: ‘PyMOL Viewer’ and ‘PyMOL Tcl/Tk GUI’

How do I load a PDB file from the web and display a molecule?

As an example, we will use the structure of proteinase K, which can be found at the Protein Data Bank with PDB identifier 2PKC. Open a browser and go to the website: http://www.rcsb.org

Search the site for the structure of proteinase K by typing its pdb code: 2PKC.

To download the coordinates of proteinase K press on PDB File and download it to the computer. Move or save it in the pymol directory.

To open this PDB file in pymol you do it by either writing it as a command or by using the gui.

Using the gui to open the PDB file:

In PyMOL, select Open... from the File menu and load the entire PDB file (2PKC.pdb) into the program.

If you like to open the PDB file with a command line, then type: load 2PKC.pdb in the gui.

104

! If the PDB file is not in the same directory as the one you are working the program PyMOL from, PyMOL will not be able to find the PDB file with this command unless you specify the entire path to the directory in which the PDB file can be found.

The proteinase K structure is now displayed in the viewer window which has two parts. The major part is the actual graphics display area. To the right of the graphics display area is a part containing a list of the objects being displayed at the top and information regarding the Mouse Mode at the bottom. The function of the mouse buttons can be modified by combining with the Shift- and Ctrl- keys. "SnglClk" means single click and "DblClk" means double click. Try out the function of the mouse and be sure you now how to rotate, slab and zoom in and out of the molecule.

Selection

At present you only have two objects in your object list; (all) and 2PKC. (all) is a composite object always consisting of all objects in the list. If you click on the name of the object it will be hidden and if it was hidden it will be shown again.

Now, click with the left mouse button on an atom in the protein and watch the white log window at the top of the gui screen. Clicking inside the protein will select the amino acid residue that is clicked, and identify the atom in the log window, like so:

You clicked /2PKC.pdb/2PKC//GLY`267/CA Selector: selection "sel01" defined with 4 atoms.

The program uses PyMOL's standard selection format to show the selection, which in its complete form works like this:

105

/model/segment/chain/residue/atom

If you like to deselect you can press on "Deselect" buttom in the command line interface.

At the same time you selected the residue it with appear in the object list as (sel01). If you like to keep the selection you can rename it to something meaningful like its residue number or you can also choose to remove it. All this can be done by clicking on the A menu next to the object and then either press on "delete selection" or "rename selection".

Another way to make a selection on is with a command. If you like to select the residues forming the active site of Proteinase K which is the residues Asp39, His69 and Ser224 you can do this easily if you in the command line interface write:

select active_site, (resi 39 or resi 69 or resi 224)

In the command window you will see that 24 atoms were selected.

Let's look a bit more at the command we gave: The general statement

select name, selection

is used to create a new selection, in this case one that only contains a few residues of the total structure. A selection is not a copy of part of the structure, but simply a way of organising the information. There is another command, create, which can be used to create copies of parts of the structure.

Following the command select, one types the name of the new selection. The name for the selection can be any name of your choice, but it makes sense to choose something that is easy to remember, such as "active_site", or "substrate".

After the selection name, we supply a comma to indicate that the next argument is about to follow, namely the atoms to include in the selection. The selection is enclosed in a set of outer parentheses, which is necessary to ensure that everything inside the new selection is evaluated before the statement is carried out. The actual algebraic selection statement therefore reads: resi 39 or resi 69 or resi 224

People who are inexperienced with boolean expressions are often confused by the use of the word "or" in the statement, because why do you write or when you want to select atoms in residue 39 and residue 69 and residue 224? The other statement,

resi 39 and resi 69 and resi 224

would in fact select no atoms at all because you are asking for atoms that satisfy both conditions, i.e. that the atoms is present in both resi 39 and resi 69 and resi 224 at the same time, which is of course not possible.

So the answer to the question is that there is a difference between the grammatical use of the word "and", and the logical use of the word. The easiest way to understand the

106

statement in a grammatical sense is to append the sentence "Give me all atoms that are in" to the beginning. Now "Give me all atoms that are in resi 39 or resi 69 or resi 224" makes much more sense, and it is clear that "Give me all atoms that are in resi 39 and resi 69 and resi 224" will return no atoms at all.

ASHLC menues

To the right of the object name there is 5 menus; A, S, H, L, and C, which is short for Action, Show, Hide, Label and Colour.

Click on A next to the object "2PKC" with a single click on the left mouse button and select “Remove Waters” and observe that the red crosses disappear from the display.

(It can also be done with the command line: remove /2PKC.pdb/2PKC//HOH)

Click on S next to the object "2PKC" and select “Cartoon” and sheets and helices are shown.

(command line: show cartoon, 2PKC.pdb)

Note that lines showing the bonds are still displayed. These can be removed: Click on H and select Lines for "2PKC" or for the object "all".

(command line: hide lines, 2PKC.pdb)

Click on C and select a nice colour such as "blues" and in the submenu the colour "slate" - observe the effect.

(command line: color slate, 2PKC.pdb)

107

In the L- menu it is possible to add labels to the display. It is advantage to select certain atoms or residues to be labeled as the display otherwise will be crowded by labels.

When you look at the structure now after it has been displayed as cartoon it is much easier to inspect. To present the structure even more smoothly you can click in the command line interface on: Settings – Cartoon – Smooth loops or write: set cartoon_smooth_loops, on We now want to make an illustration of the active site of proteinase K protein. We have already learned how to select the residues and a good representation of the active site would be to display the residues as sticks. This can be done either with a click on S and sticks next to the active_site object or with the command: show sticks, active_site You can use the tab to search for unique commands so you do not have to have to write them out every time. If you in the command line only write: s and the press the tab button a list of all possible commands comes up. If you write: sh these two letters is unique for the command "show" and "show" will automatically come up in the command line. The residues of the active site are coloured slate and we will now give them a different colour to better distinguish when from the rest of the structure. Again two ways to do this: Either press: C – by element – and choose the colours you like the atoms of these residues to have. This can also be done with commands like this: color tv_green, active_site and elem C

108

Try now to zoom in on the active site and display a nice view of it. When you zoom in you can see that the Ca atom which is part of the cartoon representation is also green. But the figure would look much more nice if the had the same colour as the rest of the cartoon representation, namely slate. This can again be chance on two ways. Either by double clicking on the Ca atom with left mouse button and the choose atom – color – blues – slate, or by writing the command: color slate, (segi 2PKC.pdb and resi 224 and name CA) Change the colour of the Ca atom of all 3 residues within the active site. The CO and N from the peptid bond is also still displayed for the residues within the active site and these we will remove to have a more clear picture of the active site. You can either double click on them and choose: atom – hide – sticks or you can write: hide sticks, (segi 2PKC.pdb and resi 39 and name O) Remove the all the C, O and N atom which is part of a peptid binding of the residues within the active site. This can also easily be done with the command: hide sticks, (active_site and name N) or (name O) or (name C) Now one can really see the advantage of using a command line instead of spending lots of time by clicking the individual atoms away J If you like the traditional stick and ball representation you can also make it with pymol with the command lines: set stick_ball, on set stick_ball_ratio, 1.3 The stick_ball_ratio goes from 0 to ∞, however making it much larger than 5 do not make any sense.

If you like to now the distance between two atoms this can be found with the "distance" command, but you need to now the exact name of the two atoms for which you want to measure the interatomic distance. This can easily be found if you just click on the respective atoms and write the command line: distance atom1, atom2 like:

distance /2PKC//HIS

69/NE2, /2PKC//SER 224/OG

109

How to process your data sets with XDS

ByBy Laure YatimeLaure Yatime

Before starting the processing: Get the input file XDS.INP from the beamline to have the correct parameters for the detector. You can find a list of templates for different detectors on the following website (as well as the input files for XSCALE and XDSCONV):

http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_prepare.html For general documentation on XDS:

http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_files.html

http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_program.html

Processing - 1st step : Find the space group and the cell parameters

Edit the XDS.INP file (with gedit for example) and fill in the following parameters (if not already prefilled by the software for data collection like at ESRF):

- Parameters corresponding to the data set and collection (path to find the

frames NAME_TEMPLATE_OF_DATA_FRAMES=, start angle for the oscillation, oscillation range, wavelength, detector distance, number of frames collected – DATA RANGE)

Note: if you don’t remember these parameters, in particular the detector distance you used (which is very important for proper indexing), you can open any of you frames with adxv and in the menu panel on top of the window, choose View and then Image Header. All the info are listed in this file. - the RESOLUTION_RANGE - try to cut the high resolution limit the

closest to the value expected to avoid adding to much noise - the range of frames to consider to measure the background

(BACKGROUND_RANGE) ® usually 10-20 frames at the beginning of the data set

- the range of frames to consider to take the spots from the frames and do the indexing (SPOT_RANGE) ® because the processing is fast, you should try to include a lot of frames (the first 100 or 200 frames)

- the space group number : 0 for the first run - correct the beam position (ORGX and ORGY) by looking at the image

header of any of the frames in adxv display - finally, specify the jobs you want to run: for the first run, the list of jobs

must be the following JOB= XYCORR INIT COLSPOT IDXREF

110

to find the spots (COLSPOT) and index (IDXREF) (see http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_program.html for a detailed description of each task)

- remember to specify FRIEDEL’s_LAW = TRUE or FALSE depending on your data

Save and run with:

> xds ¿ (or xds_par to use the different processors of the computer in parallel)

Edit the output file IDXREF.LP where all the possible space groups are listed

with a score corresponding to their probability : the correct space group is the one with the lowest score but with the highest symmetry (fx, P1 is always the best solution with a score of 0 but it’s almost never your solution !!). Here is an example:

LATTICE- BRAVAIS- QUALITY UNIT CELL CONSTANTS (ANGSTROEM & DEGREES) CHARACTER LATTICE OF FIT a b c alpha beta gamma * 44 aP 0.0 84.5 84.6 637.4 90.0 90.1 90.2 * 31 aP 2.0 84.5 84.6 637.4 90.0 89.9 90.2 * 33 mP 4.1 84.5 84.6 637.4 90.0 90.1 90.2 * 14 mC 4.2 119.3 119.8 637.4 90.0 90.1 89.9 * 35 mP 5.5 84.5 84.6 637.4 90.0 90.1 90.2 * 10 mC 6.2 119.3 119.8 637.4 90.0 90.1 90.1 * 34 mP 6.9 84.5 637.4 84.6 90.0 90.2 90.1 * 13 oC 8.4 119.3 119.8 637.4 90.0 90.1 89.9 * 32 oP 8.9 84.5 84.6 637.4 90.0 90.1 90.2 * 11 tP 10.4 84.5 84.6 637.4 90.0 90.1 90.2 37 mC 247.2 1277.5 84.5 84.6 90.2 90.0 86.3 41 mC 248.6 1277.5 84.5 84.6 90.2 90.0 86.3 36 oC 249.3 84.5 1277.5 84.6 90.0 90.2 93.7 28 mC 250.3 84.5 1277.5 84.6 90.0 89.8 86.3 40 oC 250.7 84.5 1277.5 84.6 90.0 90.2 93.7 30 mC 251.7 84.5 1277.5 84.6 90.0 89.8 86.3 39 mC 252.3 188.6 84.6 637.4 90.0 90.1 63.6 29 mC 254.3 84.5 189.4 637.4 90.0 90.1 63.3 38 oC 254.9 84.5 188.8 637.4 89.9 90.1 116.3 12 hP 256.4 84.5 84.6 637.4 90.0 90.1 90.2 42 oI 495.8 84.5 84.6 1280.2 93.7 93.7 90.2 15 tI 496.5 84.5 84.6 1280.2 86.3 86.3 90.2 27 mC 499.2 189.4 84.5 643.0 90.1 96.8 63.3 17 mC 501.8 119.3 119.8 637.4 90.0 90.1 89.9 26 oF 622.1 84.5 189.4 1277.5 88.3 93.7 116.7 9 hR 748.6 84.5 119.8 1915.8 90.0 92.4 135.1 16 oF 992.9 119.8 119.3 1280.2 95.3 90.0 89.9 43 mI 995.7 119.8 1280.2 84.6 93.7 135.2 90.0 1 cF 999.0 648.5 648.3 648.6 165.0 21.3 165.0 2 hR 999.0 119.8 642.9 648.6 163.2 90.0 95.3 3 cP 999.0 84.5 84.6 637.4 90.0 90.1 90.2 5 cI 999.0 643.0 119.3 642.9 84.8 10.7 84.7 4 hR 999.0 119.8 643.1 648.3 163.2 90.0 95.4 6 tI 999.0 643.0 642.9 119.3 84.8 84.7 10.7 7 tI 999.0 642.9 119.3 643.0 84.7 10.7 84.8 8 oI 999.0 119.3 642.9 643.0 10.7 84.7 84.8 18 tI 999.0 643.0 648.5 84.5 82.6 89.9 163.1 19 oI 999.0 84.5 643.0 648.5 16.9 82.6 90.1 20 mC 999.0 643.0 643.0 84.5 90.0 90.1 164.9 21 tP 999.0 84.6 637.4 84.5 90.1 90.2 90.0 22 hP 999.0 84.5 637.4 84.6 90.0 90.2 90.1 23 oC 999.0 643.0 643.0 84.5 90.0 90.1 15.1 24 hR 999.0 664.6 643.0 84.5 90.0 82.8 23.5 25 mC 999.0 643.0 643.0 84.5 90.0 90.1 15.1

Here, the correct Bravais lattice is very clear, it is tP, which encompasses the following space groups:

111

tP [75,P4] [76,P4(1)] [77,P4(2)] [78,P4(3)] [89,P422] [90,P42(1)2] [91,P4(1)22] [92,P4(1)2(1)2] [93,P4(2)22] [94,P4(2)2(1)2] [95,P4(3)22] [96,P4(3)2(1)2]

You then have to process with the different space groups to figure out which one is the correct one (see later on)

Processing – 2nd step : Integrate the data to get your Structure Factors and the data statistics

Once you know the spage group, reopen XDS.INP and this time, specify the spage group number and the cell parameters given by IDXREF.LP.

Note that you have to give the appropriate cell parameters for the space group that you chose. For example, if the 3 angles must be 90° in the space group you specify, you have to force them to 90° even if IDXREF.LP gives a 90.1 or 89.9 value. In the same way, if you have a P4 space group, theoretically a=b, so you have to force them to be the same value. In the example given above, IDXREF.LP gave the following solution which corresponds to a P4 or P422 space group with possible screw axis:

11 tP 10.4 84.5 84.6 637.4 90.0 90.1 90.2

When rerunning xds, you have to put the following cell parameters: 84.5 84.5 637.4 90.0 90.0 90.0 (i.e forcing a=b and all 3 angles at 90°, otherwise XDS will complain) Once it’s OK, change the JOB list to:

JOB= ALL

and run again: > xds_par ¿ Rem: sometimes, at this stage of the processing, xds stops after IDXREF with an error message saying the percentage of indexed reflections is too low. And it suggests you to rerun xds by changing the JOB task JOB= ALL to JOB= DEFPIX INTEGRATE CORRECT. Just follow this advice and rerun xds with :

JOB= DEFPIX INTEGRATE CORRECT The integration should be performed correctly then.

Look at the CORRECT.LP file (statistics at the end of the file, for the whole set of frames – first table where intensities>-3.0 to take into account all the reflections) to see the completness, the I/sI and the Rsym (eventually the

112

anomalous signal). Usually, the CORRECT.LP file is not enough to evaluate the quality of your data set and you need to scale it first with xscale

The CORRECT.LP file gives however some important information. First, at the bottom of the file, you can find a list of reflection that are not obeying the Wilson plot. These reflection should be excluded from the integration. To do that, copy the list of number in a new file that you call REMOVE.HKL (just the list of numbers). When you’ll run the next step, it will automatically look at this file and remove from the scaling every reflection contained in REMOVE.HKL.

The CORRECT.LP file also enables you to define if there are any screw axis in your space group (i.e is it P422 or P4322 or P43212 for example). For that, look at the statistics corresponding to the entire range of frame (near the bottom of the file, STATISTICS OF SAVED DATA SET "XDS_ASCII.HKL"). There you find a list of reflections on special positions that will present some extinctions if there are screw axis (REFLECTIONS OF TYPE H,0,0 0,K,0 0,0,L OR EXPECTED TO BE ABSENT (*))

On the next page is the list for the example treated above:

113

H K L RESOLUTION INTENSITY SIGMA INTENSITY/SIGMA #OBSERVED 0 0 12 53.248 0.1646E+06 0.1362E+05 12.09 1 0 0 13 49.152 0.1970E+02 0.1702E+01 11.58 4* 0 0 14 45.641 -0.1508E+00 0.1308E+01 -0.12 4* 0 0 15 42.599 0.1979E+02 0.1792E+01 11.04 4* 0 0 16 39.936 0.1973E+06 0.1632E+05 12.09 1 0 0 17 37.587 0.2373E+02 0.2012E+01 11.80 4* 0 0 18 35.499 0.1699E+01 0.1547E+01 1.10 4* 0 0 19 33.630 0.2340E+02 0.2113E+01 11.07 4* 0 0 20 31.949 0.8173E+05 0.3904E+04 20.93 3 0 0 21 30.428 0.1747E+02 0.2042E+01 8.56 4* 0 0 22 29.045 0.2697E+01 0.1836E+01 1.47 4* 0 0 23 27.782 0.5496E+01 0.1971E+01 2.79 4* 0 0 24 26.624 0.2468E+05 0.1023E+04 24.12 4 0 0 25 25.559 0.6575E+01 0.2073E+01 3.17 4* 0 0 26 24.576 -0.1965E+00 0.2119E+01 -0.09 4* 0 0 27 23.666 0.1154E+01 0.2225E+01 0.52 4* 0 0 28 22.821 0.2150E+04 0.8924E+02 24.10 4 0 0 29 22.034 0.2146E+01 0.2344E+01 0.92 4* 0 0 30 21.299 -0.2861E+01 0.3316E+01 -0.86 2* 0 0 31 20.612 0.6759E+00 0.3449E+01 0.20 2* 0 0 32 19.968 0.2434E+02 0.4180E+01 5.82 2 0 0 33 19.363 -0.6267E+00 0.2662E+01 -0.24 4* 0 0 34 18.794 0.8045E+00 0.2672E+01 0.30 4* 0 0 35 18.257 0.3270E+01 0.2632E+01 1.24 4* 0 0 36 17.749 0.2083E+05 0.1219E+04 17.09 2 0 0 37 17.270 0.8438E+01 0.4689E+01 1.80 2* 0 0 38 16.815 0.1309E+01 0.2890E+01 0.45 3* 0 0 39 16.384 0.2564E+01 0.3246E+01 0.79 4* 0 0 40 15.974 0.5776E+04 0.2394E+03 24.12 4 1 0 0 84.800 -0.3715E+01 0.4736E+00 -7.85 3* 2 0 0 42.400 0.9029E+04 0.3762E+03 24.00 4 3 0 0 28.267 0.1597E+00 0.8548E+00 0.19 6* 4 0 0 21.200 0.2850E+03 0.1013E+02 28.14 6 5 0 0 16.960 -0.2005E+01 0.1413E+01 -1.42 6* 6 0 0 14.133 0.1975E+02 0.2015E+01 9.80 6 7 0 0 12.114 0.3979E+01 0.1918E+01 2.07 6* 8 0 0 10.600 0.7510E+03 0.3152E+02 23.82 4 9 0 0 9.422 0.2947E+02 0.3103E+01 9.50 5* 10 0 0 8.480 0.3593E+03 0.2055E+02 17.48 3 11 0 0 7.709 0.1036E+02 0.4465E+01 2.32 5* 12 0 0 7.067 0.4684E+02 0.5972E+01 7.84 4 13 0 0 6.523 -0.7902E+01 0.5863E+01 -1.35 4* 14 0 0 6.057 0.5412E+01 0.6091E+01 0.89 4 15 0 0 5.653 -0.6516E+00 0.6501E+01 -0.10 4* 16 0 0 5.300 -0.6888E+01 0.7034E+01 -0.98 4 17 0 0 4.988 0.1943E+01 0.7131E+01 0.27 4* 18 0 0 4.711 0.4183E+01 0.1057E+02 0.40 2

You can find the extinction conditions for each type of space group on the following webpage:

http://www.mazepath.com/uncleal/crystal.htm

For example, in our test case, the 0,0,k reflections give indication of a possible screw axis along the 4-fold axis. Indeed, extinctions are encountered for the following reflections: 0,0,14 ; 0,0,18 ; 0,0,22 ; 0,0,26 etc... in other words, all the 0,0,4n+2 reflections, suggesting a 43 or 41 screw axis. The h,0,0 reflections also suggest a 21 screw axis along one of the 2-fold axis. It is sometimes tricky to

114

evaluate which screw axis it is, but you can at least say if there are extinctions or not. And then you’ll get the final answer when solving the structure (for example, you can ask during the molecular replacement to look for solutions for each space group: P43212, P41212, ... and only the correct one will give you a clear solution). Note: you don’t need to have the correct screw axis to scale your data, this will note make any difference in the overall quality of the statistics (i.e if you process and scale in P43212 or P41212, you will get the exact same statistics). It only matters for solving the structure. SCALING with XSCALE:

Edit the XSCALE.INP file (found on the beamline or on the web) and fill in the following parameters: resolution shell, space group number, cell parameters, name of the output file, path to find the input file XDS_ASCII.HKL (usually in the same directory), FRIEDEL’s_LAW and resolution range.

Rem: the XPARM.XDS file contains the cell parameters and is constantly updated with refined parameters. So you can use this file to fill in the refined cell parameters in XSCALE.INP

Run xscale with: > xscale ¿

Look at the scaled statistics in XSCALE.LP. Depending on the result, you can rerun xscale but cutting the resolution to a lower limit (or higher !! let’s be optimistic).

Processing – 3rd step : Convert your intensities file to a mtz file with the

Structure Factors

The conversion is done in 2 steps, first by using xdsconv. Edit the XDSCONV.INP file (found on the beamline or on the web) and fill in the following parameters: space group number, cell parameters, name of the input file (output from xscale), name of the output file, format of the output file (IALL, CCP4, CNS, ...), FRIEDEL’s_LAW and resolution range.

!***************************************************************************** ! EXAMPLE of XSCALE.INP for scaling MAD data sets ! ! Characters in a line to the right of an exclamation mark are comment. !***************************************************************************** MAXIMUM_NUMBER_OF_PROCESSORS=16!<33;ignored by single cpu version of xscale RESOLUTION_SHELLS= 10 6 5 4 3 2.7 2.6 2.5 SPACE_GROUP_NUMBER=5 UNIT_CELL_CONSTANTS=228.366 69.910 113.550 90.000 97.570 90.000 !REIDX=-1 0 0 0 0 -1 0 0 0 0 -1 0 !0-DOSE_SIGNIFICANCE_LEVEL=0.10 OUTPUT_FILE=xscale.hkl FRIEDEL'S_LAW=FALSE !TRUE INPUT_FILE= XDS_ASCII.HKL XDS_ASCII 20 2.5

115

The IALL format creates a file with Intensities and after F2MTZ, you have to run truncate to get the Structure Factors The CCP4 format creates a file compatible with CCP4 and containing directly the Structure Factors. So after F2MTZ, you can directly run MR with the PHASER module from PHENIX fx.

The second step consists to run f2mtz to get an mtz file. Edit a new file called f2mtz.sh (you can copy the last lines of the XDSCONV log) :

Give the appropriate name to the output_file and run with:

> sh f2mtz.sh ¿ a output_file_name.mtz file is created, corresponding to the I or F

If you chose a format in XDSCONV.INP that gives you the intensities, you need to run truncate to get the Structure Factors

!***************************************************************************** ! Program XDSCONV converts XDS-files to serve as input to structure solution ! and refinement program packages. ! XDSCONV can produce files for the following target software-packages : ! SHELX ! CNS ! CCP4 output already in F's as required by the CCP4 package ! IALL output in I's as required by 'truncate' of the CCP4 package ! FALL ! XtalView ! For details see chapter 'XDSCONV:Output formats' of the XDS documentation. ! ! EXAMPLE of XDSCONV.INP for converting MAD data sets to CNS input format ! !***************************************************************************** SPACE_GROUP_NUMBER=5 UNIT_CELL_CONSTANTS=228.366 69.910 113.550 90.000 97.570 90.000 INPUT_FILE=xscale.hkl XDS_ASCII 20 2.5 !This omits reflections with I/sigma(I) below a specified cut-off !NEGATIVE_INTENSITY_CUTOFF=-3 !File-name, target software-package, and validity of Friedel's law OUTPUT_FILE=ip.hkl CCP4 FRIEDEL'S_LAW=FALSE !TRUE

f2mtz HKLOUT temp.mtz<F2MTZ.INP cad HKLIN1 temp.mtz HKLOUT output_file_name.mtz<<EOF LABIN FILE 1 ALL END EOF

truncate HKLIN ip.mtz HKLOUT Fdata.mtz <<END-truncate>> truncate.log LABIN IMEAN=I SIGIMEAN=SIGI LABOUT F=FP SIGF=SIGFP SYMMETRY 5 NRES 1268 TRUNCATE NO END

116

Rem: NRES corresponds to the number of residues in the asymmetric unit (if 4 mol/a.u, fx, NRES = 4 x number of residues in your protein)

Rem2: the truncate.log file contains the Wilson Plot of the intensities and it also indicates the solvent percent in the crystal and the average B factor)

117

Some useful www links: Academia www.bioxray.au.dk/ Center for Structural Biology – contact us for a great deal on a crystal structure! www-structmed.cimr.cam.ac.uk/ Structural Medicine at Cambridge University (nice introduction to protein crystallography theory too) www.ccp4.ac.uk/ccp4i_main.php The Collaborative Computational Project Number 4 in Protein Crystallography www.csb.yale.edu/ Center for Structural Biology – Yale University http://www.esrf.eu/about/synchrotron-science Synchrotrons – yes baby... Tutorials www.ruppweb.org/Xray/101index.html Crystallography tutorial http://www.johnloomis.org/ece563/notes/fourier/duck/fourier.html Fourier tutorial

118

Important Linux commands Command Function pwd Print the full pathname

of the current working directory ls display list of files ls *.pdb display all files in the current

directory ending with .pdb cd [NAME] change to subdirectory [NAME] cd change to home directory cd .. change to the directory one level up mkdir [NAME] create directory [NAME] mv [NAME1] [NAME2] renames file [NAME1] to [NAME2] rm [NAME] delete(remove) file [NAME] rm *.txt delete all files in the current

directory ending with .txt rm –r [NAME] delete directory [NAME] cp [NAME1] [NAME2] copy [NAME1] to [NAME2]. cp [NAME1] [NAME2] . copy [NAME1] to [NAME2] placed

in this directory. top display and update information

about the top cpu processes. kill [NUMBER] terminates a process with number [NUMBER] slogin secure login to another computer [COMPUTER] more [NAME1] display contents of file [NAME1] emacs [NAME1] opens the file [NAME1] in the EMACS editor. emacs & opens the EMACS editor but puts the job in the background so you can continue the use of the terminal window.