embo practical course cem3dip 2020 tutorial 10: flexible ......+ bin # symbolic link to the binary...

48
Flexible fitting tutorial 1 EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible fitting (MDFit and Ensemble fittings) Tutorials on molecular dynamics simulations of biological molecules and applications for flexible fitting into cryo-EM data Florence Tama and Osamu Miyashita (Nagoya University and RIKEN) 30 January 2020 This tutorial first covers a basic concept of molecular dynamics simulations and a few practical examples, and then demonstrate the examples of cryo-EM flexible fitting applications. GENESIS, a molecular dynamics simulation package for biomolecular systems, developed by RIKEN Sugita Team, will be mainly used in this tutorial. https://www.r-ccs.riken.jp/labs/cbrt/ This tutorial is based on GENESIS Tutorials developed by T. Mori and other Sugita Team members with modifications and additions to match the topic of this course. https://www.r-ccs.riken.jp/labs/cbrt/tutorials2019/ 0. Preface In these tutorials, we will perform several interactive works as well as a few small (a few minutes long parallel) calculations. Please have at least two terminal windows. One for interactive works (VMD, gnuplot). In this terminal, run $ export PATH=$PATH:/opt/ohpc/pub/apps/vmd-1.9.4a38/bin/ Please check if typing $ vmd opens up some windows. Another one can be used for the calculations using cluster (Situs and GENESIS). # login to a cluster node $ qsub -I -q parallel -l nodes=1:ppn=12 # then cd to the working directory $ cd /lustrefs0/workshop/userXXX/prac10genesis/ 1. Preparation of GENESIS for Tutorials First of all, let’s make a directory for the Tutorial of GENESIS under your work directory by typing the following commands.

Upload: others

Post on 27-Sep-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

1

EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible fitting (MDFit and Ensemble fittings) Tutorials on molecular dynamics simulations of biological molecules and applications for flexible fitting into cryo-EM data Florence Tama and Osamu Miyashita (Nagoya University and RIKEN) 30 January 2020 This tutorial first covers a basic concept of molecular dynamics simulations and a few practical examples, and then demonstrate the examples of cryo-EM flexible fitting applications. GENESIS, a molecular dynamics simulation package for biomolecular systems, developed by RIKEN Sugita Team, will be mainly used in this tutorial. https://www.r-ccs.riken.jp/labs/cbrt/ This tutorial is based on GENESIS Tutorials developed by T. Mori and other Sugita Team members with modifications and additions to match the topic of this course. https://www.r-ccs.riken.jp/labs/cbrt/tutorials2019/

0. Preface In these tutorials, we will perform several interactive works as well as a few small (a few minutes long parallel) calculations. Please have at least two terminal windows. One for interactive works (VMD, gnuplot). In this terminal, run $ export PATH=$PATH:/opt/ohpc/pub/apps/vmd-1.9.4a38/bin/ Please check if typing $ vmd opens up some windows. Another one can be used for the calculations using cluster (Situs and GENESIS). # login to a cluster node $ qsub -I -q parallel -l nodes=1:ppn=12 # then cd to the working directory $ cd /lustrefs0/workshop/userXXX/prac10genesis/

1. Preparation of GENESIS for Tutorials First of all, let’s make a directory for the Tutorial of GENESIS under your work directory by typing the following commands.

Page 2: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

2

# Make a directory for GENESIS $ cd /lustrefs0/workshop/userXXX/prac10genesis $ pwd /lustrefs0/workshop/userXXX/prac10genesis $ mkdir GENESIS $ cd GENESIS Compilation of GENESIS is relatively simple because it does not use many external software libraries (except for parallelization and GPU options). For installation please refer to https://www.r-ccs.riken.jp/labs/cbrt/installation/. For this tutorial session, the binary is already installed under: /opt/ohpc/pub/apps/genesis-1.4.0_installed. Let’s simply create a link for convenience for the later tutorials. $ ln -s /opt/ohpc/pub/apps/genesis-1.4.0_installed/bin/ . $ ls bin $ ls ./bin atdyn energy_analysis pcavec_drawer rpath_generator avecrd_analysis flccrd_analysis pcrd_convert rst_convert comcrd_analysis fret_analysis pmf_analysis rst_upgrade crd_convert hb_analysis prjcrd_analysis spdyn diffusion_analysis kmeans_clustering prst_setup tilt_analysis distmat_analysis lipidthick_analysis qmmm_generator trj_analysis drms_analysis mbar_analysis qval_analysis wham_analysis dssp_interface meanforce_analysis remd_convert eigmat_analysis msd_analysis rg_analysis emmap_generator pathcv_analysis rmsd_analysis Then let’s set up a few directories. # Make a directory for Tutorials $ mkdir Tutorials # Make a directory for Other tools $ mkdir Others Now, we have the following directories. All tutorials will be done in the “Tutorials” directory. /home/user + GENESIS + bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others In Others directory, we put force field parameters (see Tutorial 2.1).

Page 3: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

3

2.1 Preparation of the force field parameters

What’s needed to run simulations? In order to run MD simulations of proteins, we first need to prepare the initial structure of the target system. In most cases, three-dimensional (3D) structure of the target protein is available from the web site of the Protein Data Bank (PDB). Since many structure were determined with the X-ray crystallography, hydrogen atoms are not included in the PDB files. You may want to solvate the protein in water in the MD simulation. In this case, we have to add hydrogen atoms to the protein heavy atoms, and also put water molecules around the protein to construct the initial structure.

In the MD simulation, potential energy of the system is calculated by

This equation is called “Force Field”. We can see many empirical parameters such as force constants (kb and ka), equilibrium bond length (r0), depth of the potential energy (Vn), atomic charge (q), and etc. If we see the PDB file, we cannot find such information. MD software also does not contain them. Therefore, we need to prepare the force field parameters. Moreover, we need information about “topology” in the target system. Topology means the “atom connectivity” in the molecules, namely, “which atoms are connected through covalent bond”. Such information is essential to compute the “summation” of each term in the equation. However, most MD software cannot automatically create the topology from the PDB coordinates due to complexity of the process. Accordingly, we also need to prepare the topology information of the target system before starting the MD simulation.

Page 4: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

4

Download the force field parameters One of the commonly used parameters for biomolecules is the CHARMM force field [1], which was originally developed by the Karplus group at the Harvard University. We can obtain the files that contain the force field parameters from the CHARMM group’s web site. Let’s download the latest version of the force fields (CHARMM36m [2]) from the Mackerell’s web site (http://mackerell.umaryland.edu/), and put them in the Others directory. # Download the CHARMM parameter and topology files $ cd /lustrefs0/workshop/userXXX/prac10genesis/Others $ mkdir CHARMM $ cd CHARMM $ wget http://mackerell.umaryland.edu/CHARMM_ff_params_files/toppar_c36_jul18.tgz $ tar -zxvf toppar_c36_jul18.tgz $ ls toppar toppar_c36_jul18.tgz After you uncompress the download file, you can see a lot of files in the directory. Here, .prm and .rtf files are called “parameter” and “residue topology” files, respectively. Among these files, ***_prot.*, ***_na.*, and ***_lipid.* are related to proteins, nucleic acids (na), and lipids, respectively. # Check the contents in toppar $ cd toppar $ ls 00toppar_file_format.txt par_all36_cgenff.prm top_all35_ethers.rtf ace par_all36_lipid.prm top_all36_carb.rtf cheq par_all36_na.prm top_all36_cgenff.rtf drude par_all36_prot.prm top_all36_lipid.rtf gbsw par_all36m_prot.prm top_all36_na.rtf larmord par_hbond.inp top_all36_prot.rtf metals param19.inp top_all36_prot_water_ions.rtf non_charmm rush toph19.inp openmm_gbsaobc2 silicates toppar_all.history par_all22_prot.prm stream toppar_water_ions.str par_all35_ethers.prm tamdfff par_all36_carb.prm top_all22_prot.rtf Let’s view par_all36m_prot.prm by using the less command. The following shows a part of the parameters for the bond energy term. You can find bond force constants (3rd column) and equilibrium bond lengths (4th column) for each atom type or atom pair in the amino acids. Let’s find the parameters for other energy terms such as angle, dihedral angle, van der Waals, and so on. For detailed description of the parameter files, see the CHARMM manual (parmfile). # View the parameter file for proteins $ less par_all36m_prot.prm

Page 5: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

5

: BONDS ! !V(bond) = Kb(b - b0)**2 ! !Kb: kcal/mole/A**2 !b0: A : CA CAI 305.000 1.3750 ! from CA CA CAI CAI 305.000 1.3750 ! atm, methylindole, fit CCDSS CPT CA 300.000 1.3600 ! atm, methylindole, fit CCDSS CPT CAI 300.000 1.3600 ! atm, methylindole, fit CCDSS CPT CPT 360.000 1.3850 ! atm, methylindole, fit CCDSS : Let’s view top_all36_prot.rtf. This file contains the atom connectivity (topology) as well as atomic mass and charge in the amino acids. The following shows a part of the definitions for the atomic mass, and also topology of the alanine (ALA). The mass of each atom type is defined in the line starting with “MASS” (red). In the topology information, covalent bonds are defined in the line starting with “BOND“, where the neighboring atom names (blue or green pairs) are connected. The partial charge of each atom is defined in the 4th column (purple). For example, hydrogen has a partial change of +0.09 e. For detailed description of the topology files, see the CHARMM manual (rtop). # View the topology file for proteins $ less top_all36_prot.rtf : MASS -1 CE1 12.01100 ! for alkene; RHC=CR MASS -1 CE2 12.01100 ! for alkene; H2C=CR MASS -1 CAI 12.01100 ! aromatic C next to CPT in trp MASS -1 N 14.00700 ! proline N MASS -1 NR1 14.00700 ! neutral his protonated ring nitrogen : : RESI ALA 0.00 GROUP ATOM N NH1 -0.47 ! | ATOM HN H 0.31 ! HN-N ATOM CA CT1 0.07 ! | HB1 ATOM HA HB1 0.09 ! | / GROUP ! HA-CA--CB-HB2 ATOM CB CT3 -0.27 ! | \ ATOM HB1 HA3 0.09 ! | HB3 ATOM HB2 HA3 0.09 ! O=C ATOM HB3 HA3 0.09 ! | GROUP ! ATOM C C 0.51 ATOM O O -0.51 BOND CB CA N HN N CA BOND C CA C +N CA HA CB HB1 CB HB2 CB HB3 DOUBLE O C :

Page 6: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

6

We can also find .str file in the directory. This is called “stream file”, in which the topology and parameters are defined together. toppar_water_ions.str contains the information about topology and parameters of water and ions. # View the stream file for water and ions $ less toppar_water_ions.str In this tutorial, we mainly focused on par_all36m_prot.prm, top_all36_prot.rtf, and toppar_water_ions.str. These three files are essential to perform MD simulations of proteins in water with the CHARMM force field. For the MD simulation of membrane proteins, par_all36_lipid.prm and top_all36_lipid.rtf are additionally needed, since they contain information about lipid molecules. Now, we got parameter and topology files for the CHARMM force field. In the next Tutorial, we will learn how to setup the initial structure of the target system. References 1. A. D. MacKerell, Jr. et al., J. Phys. Chem. B, 102, 3586 (1998). 2. J. Huang et al., Nat. Methods, 14, 71-73 (2017).

2.2 Building the initial structure of the target system In this tutorial, we learn how to setup the system for the all-atom MD simulation with the CHARMM force field. Since GENESIS is not providing a structure modeling tool, we utilize VMD for this purpose [1]. We additionally use psfgen-plugin, solvate-plugin, and autoionize-plugin in VMD. The following figure shows the outline of a general scheme to build the system for soluble proteins. We employ Protein G as an example, and solvate it in 150 mM NaCl solution.

Preparation Copy the tutorial file (tutorial-2.2.tar.gz). As shown in the above figure, this tutorial consists of five steps: 1) download a PDB file, 2) modify the PDB file, 3) build missing atoms, 4) add water, and 5) add ions. Since we use the CHARMM force field, we make a symbolic link to the CHARMM toppar directory (see Tutorial 2.1). # Make a directory of this tutorial $ cd Tutorials $ cp /lustrefs0/workshop/data/prac10genesis/tutorial-2.2.tar.gz ./ $ tar -xvzf tutorial-2.2.tar.gz $ cd tutorial-2.2 $ ls

Page 7: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

7

1_oripdb 2_modpdb 3_psfgen 4_solvate 5_ionize # Make a symbolic link to the CHARMM toppar directory $ ln -s ../../Others/CHARMM/toppar ./ $ ls 1_oripdb 2_modpdb 3_psfgen 4_solvate 5_ionize toppar

Step1. Download the PDB file of the target protein Let’s change directory to 1_oripdb, and download a PDB file of Protein G. Here, we use the PDB entry: 2QMT, which was solved with the X-ray crystallography at 1.05 Å resolution. View the structure by using VMD. We can find the protein atoms (not hydrogen but heavy atoms), oxygen of water, and other chemical compounds such as phosphate ion, isopropyl alcohol, and methylpentane diol. These chemical compounds were basically employed for crystallization of the protein in experiments, but not essential in our MD simulations. Now, let’s check the center of mass of the protein by using the “measure center” command in VMD (see below). We can see that it is significantly deviated from the origin (0, 0, 0). Someone may feel inconvenience for this deviation when the protein is solvated in water (Step4). In the next step, we will remove unnecessary chemical compounds from the PDB file, and also shift the center of mass to the origin. # Change the working directory $ cd 1_oripdb # Download the PDB file of protein G $ wget https://files.rcsb.org/download/2QMT.pdb $ ls 2QMT.pdb # View the PDB structure by using VMD # Measure the center of mass of the protein $ vmd 2QMT.pdb vmd > set sel [atomselect top "protein"] vmd > measure center $sel weight mass 2.618640422821045 15.951170921325684 17.753435134887695

Page 8: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

8

Step2. Modify the original PDB file In Step2, we modify the PDB file in order to setup the system properly. Let’s change directory to 2_modpdb. You can find “build.tcl” in the directory. View the file by using the less command. This is a tcl script, in which the VMD commands we want to execute are written. In the script, 1) we first read the PDB file, 2) change the residue name “HIS” to “HSD”, 3) change the atom name “CD1” in ILE to “CD”, and 4) change the atom names “O” and “OXT” in the C-terminus to “OT1” and “OT2”, respectively. These renaming is necessary, because the CHARMM force fields have their own definitions for the atom and residue names, some of which are different from general names in the PDB rule. We then 5) select protein atoms, 6) measure the center mass (com) of the selected atoms, and 7) shift the com to the origin. Finally, we write the coordinates of the selected atoms (i.e., protein) to proa.pdb. # Change directory $ cd ../2_modpdb $ ls build.tcl # View the script $ less build.tcl # Load the original PDB mol load pdb ../1_oripdb/2QMT.pdb # Rename "PDB general atom name" to "CHARMM-specific atom name" # HIS => HSD (but not included in this protein) # CD1 atom of ILE => CD # C-terminal carboxyl oxygen O and OXT => OT1 and OT2 [atomselect top "resname HIS" ] set resname HSD [atomselect top "resname ILE and name CD1" ] set name CD [atomselect top "chain A and resid 56 and name O" ] set name OT1 [atomselect top "chain A and resid 56 and name OXT"] set name OT2 # Measure the center of mass (com) of the selected atoms (protein)

Page 9: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

9

# Shift the com of the protein to the origin set sel [atomselect top "protein"] set com [measure center $sel weight mass] $sel moveby [vecscale -1.0 $com] # Write the modified PDB of the selected atoms $sel writepdb proa.pdb exit Let’s execute this script through VMD using the following command. The option “-dispdev text” means that VMD is launched on the text mode, namely, the viewer window is not opened. The above script is loaded with “-e build.tcl“, and all commands in the script are automatically executed in VMD. In the log file, you can see what is carried out in VMD. View the obtained PDB file (proa.pdb), and check whether the center of mass is actually shifted or not. # Run the script using VMD $ vmd -dispdev text -e build.tcl > log $ ls build.tcl log proa.pdb # View the structure by using VMD, and check the center of mass $ vmd proa.pdb vmd > set sel [atomselect top "all"] vmd > measure center $sel weight mass -0.0003528242523316294 0.00021110136003699154 0.0004838312161155045

In addition, check whether the designated atoms and residues were correctly renamed. # Check the renamed atoms $ less proa.pdb : ATOM 53 CG2 ILE A 6 -0.309 1.530 -8.307 1.00 11.81 C ATOM 54 CD ILE A 6 0.907 4.322 -7.613 1.00 22.44 C ATOM 55 N LEU A 7 -1.907 -0.854 -6.370 1.00 9.47 N : ATOM 431 C GLU A 56 4.078 -9.678 -9.367 1.00 19.58 C ATOM 432 OT1 GLU A 56 3.967 -8.911 -10.297 1.00 18.77 O

Page 10: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

10

ATOM 433 CB GLU A 56 2.136 -10.127 -7.893 1.00 14.63 C : ATOM 437 OE2 GLU A 56 0.400 -11.802 -5.347 1.00 16.91 O ATOM 438 OT2 GLU A 56 4.660 -10.856 -9.352 1.00 23.62 O

Step3. Make PDB and PSF files of protein In Step3, we build missing atoms (mainly hydrogen) in the target protein, and make PDB and PSF files of the protein by using psfgen-plugin with VMD. PSF is a “protein structure file”, which contains all information about the whole system including the atom connectivities, but except for the coordinates of atoms. Let’s change directory to 3_psfgen. Again, there is a script for VMD. In the script, we first turn on the plugin, and load the CHARMM36 topology file. We then define a segment name for the protein. In this example, we name it “PROA”. The coordinates of each atom in PROA are read from the PDB file (../2_modpdb/proa.pdb), and the missing atoms are constructed automatically with the “guesscoord” command. Finally, we output the PSF and PDB files. # Change directory $ cd ../3_psfgen $ ls build.tcl # Load psfgen-plugin and CHARMM topology file package require psfgen resetpsf topology ../toppar/top_all36_prot.rtf # Define the segment name as "PROA" segment PROA {pdb ../2_modpdb/proa.pdb} # Read the coordinates from the PDB file coordpdb ../2_modpdb/proa.pdb PROA # Guess the coordinates of missing atoms guesscoord # Generate PDB and PSF files writepdb protein.pdb writepsf protein.psf exit We execute this script through VMD. Let’s check the obtained PDB file (protein.pdb). Here, we additionally load the PSF file (protein.psf) with the “-psf” option to check whether both two files are correctly created or not. The PSF file contains the information about atom connectivity in the system. If the PDB and PSF files were not created correctly, strange covalent bonds might be seen in VMD. If the structure looks fine, they should have no problem. # Run the script using VMD $ vmd -dispdev text -e build.tcl > log $ ls

Page 11: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

11

build.tcl log protein.pdb protein.psf # View the structure while reading PSF file $ vmd protein.pdb -psf protein.psf

Let’s view the obtained PDB file by using the less command. There are 858 atoms in the system. The segment name “PROA” is written in the 12th or last column. # View the processed PDB file $ less protein.pdb ATOM 1 N MET A 1 -2.908 11.971 7.927 1.00 0.00 PROA N ATOM 2 HT1 MET A 1 -2.554 12.700 8.513 0.00 0.00 PROA ATOM 3 HT2 MET A 1 -2.550 11.087 8.229 0.00 0.00 PROA ATOM 4 HT3 MET A 1 -3.907 11.963 7.959 0.00 0.00 PROA ATOM 5 CA MET A 1 -2.483 12.211 6.569 1.00 0.00 PROA C We also check the PSF file with the less command. In the header part, we can see “first NTER” and “last CTER“. Here, “first NTER” indicates that the N-terminus of PROA was capped with NH3+. If you see the CHARMM topology file, you can find NTER, which is defined as a patch residue (PRES) that creates NH3 group in the N-terminus. Similarly, the C-terminus was capped with COO− by using the patch residue CTER. We can also see detailed information of all atoms in the system, including residue name, atom name, atom type, atomic charge, and atomic mass of each atom. In the middle part, bond connectivities are defined. There are 864 covalent bonds in the system, and the connected atoms are defined by the neighboring atom indexes (e.g., 1-5, 2-1, and 3-1 pairs are corresponding to “N-CA”, “HT1-N”, and “HT2-N” bonds, respectively). The angle and dihedral angle lists are also defined with a similar manner. These lists are used for Σbonds, Σangles, and Σdihedrals in the equation in Tutorial 2.1. # View the PSF file $ less protein.psf : REMARKS segment PROA { first NTER; last CTER; auto angles dihedrals } REMARKS defaultpatch CTER PROA:56 REMARKS defaultpatch NTER PROA:1 : 858 !NATOM 1 PROA 1 MET N NH3 -0.300000 14.0070 0 2 PROA 1 MET HT1 HC 0.330000 1.0080 0 3 PROA 1 MET HT2 HC 0.330000 1.0080 0

Page 12: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

12

4 PROA 1 MET HT3 HC 0.330000 1.0080 0 5 PROA 1 MET CA CT1 0.210000 12.0110 0 : 864 !NBOND: bonds 1 5 2 1 3 1 4 1 5 6 7 5 7 8 7 9 10 7 10 11 10 12 13 10 : 1556 !NTHETA: angles 1 5 6 1 5 18 2 1 5 2 1 4 2 1 3 3 1 5 3 1 4 4 1 5 5 18 19 : Why do we need to create PSF? The PSF file holds the information about the atom connectivity of the “whole” system. In fact, the topology file (top_all36_prot.rtf) does not contain the whole information of the target system, because it is designed to generally define the topology of proteins by dealing with the 20 amino acids as “fragments”.

Step4. Add water In Step4, we solvate the protein in water. Let’s change directory to 4_solvate. Again, we can find a script. In the script, we turn on the solvate-plugin, set the input PDB and PSF files of the protein, and add waters around the protein with a box size of 64 Å × 64 Å × 64 Å. In the “-minmax” option, we specify the minimum and maximum coordinates of the water box. The output file name is defined by the “-o” option. # Change directory $ cd ../4_solvate $ ls build.tcl # Solvate the protein in 64x64x64 A^3 water box package require solvate set psffile "../3_psfgen/protein.psf" set pdbfile "../3_psfgen/protein.pdb" solvate $psffile $pdbfile -minmax {{-32 -32 -32} {32 32 32}} -o wbox exit Let’s execute the script through VMD. We obtain wbox.pdb and wbox.psf. In the PDB file, we can see that the residue name of water is “TIP3”, indicating that the TIP3P water model is used. # Run the script using VMD $ vmd -dispdev text -e build.tcl > log $ ls build.tcl log wbox.log wbox.pdb wbox.psf # View the structure by using VMD $ vmd wbox.pdb -psf wbox.psf

Page 13: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

13

The box size should be decided very carefully, and it depends on the system. Basically, adequate solvation requires long distance between the protein surface and edge of the simulation box. One of the choices is to set the distance longer than the cutoff distance of the non-bonded interaction calculation. However, if your target protein undergoes a large conformational change or unfolding in the simulation, such choice might not be enough. This is because in the periodic boundary condition the protein can interact with itself in the image cells (see Figure (a)). In other cases, if you solvate a protein with a rectangular water box, you must be careful about a similar issue. The protein can rotate during the simulation, resulting in self-interactions between the proteins (see Figure (b)). These artifacts of the self-interactions affect the reliability of the MD simulations.

Page 14: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

14

Step5. Add ions In most cases, we simulate proteins in water and ions, namely, in solution with a certain ionic concentration. If the target protein functions inside the cell, we usually solvate the protein in 150 mM KCl solution [mM = mmol/L (millimoles per liter)], because the concentration of K+ inside the cell is ~150 mM. On the other hand, 150 mM NaCl solution is usually used in the case of the protein that exists outside the cell. To add ions in the system, we use the autoionize-plugin in VMD. Let’s change directory to 5_ionize. In the directory, we have a script, which is use to add sodium ions (Na+: SOD) and chloride ions (Cl−: CLA) in the system randomly. The total number of each ion is automatically adjusted to reproduce the designated ionic concentration (0.15 M). If you want to add potassium ions (K+), you specify “-cation POT” in the script. For other cases, see the user manual of autoionize-plugin. # Change directory $ cd ../5_ionize $ ls build.tcl # Add ions in the system (Salt concentration: 150 mM NaCl) package require autoionize set psffile "../4_solvate/wbox.psf" set pdbfile "../4_solvate/wbox.pdb" autoionize -psf $psffile -pdb $pdbfile -sc 0.15 -cation SOD -anion CLA exit Eventually, we obtain ionized.pdb and ionized.psf. # Run the script using VMD $ vmd -dispdev text -e build.tcl > log $ ls build.tcl ionized.pdb ionized.psf log # View the obtained structure by using VMD $ vmd ionized.pdb -psf ionized.psf

Page 15: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

15

Summary In the previous tutorial, we downloaded the CHARMM parameter and topology files, and here, we learned how to build the initial structure of the target system by using VMD/PSFGEN to create the PDB and PSF files. We note that other tools such as CHARMM [2] or CHARMM-GUI [3] are also useful to create PDB and PSF files. We are now ready to start the MD simulation. These four files (PDB, PSF, parameter, and topology) are used as the inputs of GENESIS. In Tutorial 3, we will learn how to execute GENESIS using the system constructed here.

References

1. W. Humphrey et al., J. Mol. Graph., 14, 33-38 (1996). 2. B. R. Brooks et al., J. Comput. Chem., 30, 1545-1614 (2009). 3. S. Jo et al., J. Comput. Chem., 29, 1859-1865 (2008).

3 MD simulation of Protein G in NaCl solution with the CHARMM force field

Preparation Let’s copy the tutorial file (tutorial-3.1.tar.gz ). This tutorial consists of five steps: 1) system setup, 2) energy minimization, 3) equilibration, 4) production run, and 5) trajectory analysis. Control files for GENESIS are already included in the download file. Since we use the CHARMM36m force field parameters [1], we make a symbolic link to the CHARMM toppar directory (see Tutorial 2.1). $ cd Tutorials $ cp /lustrefs0/workshop/data/prac10genesis/tutorial-3.1.tar.gz ./ $ tar -xvzf tutorial-3.1.tar.gz $ cd tutorial-3.1 $ ln -s ../../Others/CHARMM/toppar ./ $ ls 1_setup 3_equilibrate 5_analysis toppar 2_minimize 4_production

Page 16: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

16

1. Setup We simulate Protein G with the all-atom model in explicit solvent. In Tutorial 2.2, we obtained ionized.pdb and ionized.psf as the input files for GENESIS. The total number of atoms in the system is 24,552. # Change directory for the system setup $ cd 1_setup # link to the files from Tutorial 2.1 $ ln -s ../../tutorial-2.2/* . $ ls 1_oripdb 2_modpdb 3_psfgen 4_solvate 5_ionize toppar

2. Minimization First, we carry out 2,000 steps energy minimization with the steepest descent (SD) algorithm. In most cases we need energy minimization before starting the MD simulation in order to remove atomic clashes in the initial structure. We use the CHARMM C36m force field parameters. The particle mesh Ewald method (PME) [2] is employed for long-range interaction calculation with the max grid spacing of 1.2 Å. We use a cut-off distance of 12.0 Å for the non-bonded interactions, and a switching distance of 10.0 Å for the van der Waals interactions. Let’s change directory to perform energy minimization, and view the control file of GENESIS. # Change directory for the energy minimization $ cd ../../2_minimize # View the control file $ less INP In the control file, we specify the topology and parameter files of the CHARMMC36m force field in the [INPUT] section. In addition, psf and pdb files that were created in Step 1 are specified. In the [OUTPUT] section, file name for the coordinates trajectory data that will be obtained from the MD simulation is set by “dcdfile“.

We also see [MINIMIZE] section, in which we specify the method to be used in the energy minimization. We carry out 2,000 steps minimization with the steepest descent (SD) algorithm. Since the target system has a periodic boundary condition (PBC), we set type = PBC in the [BOUNDARY] section, and in addition, the initial simulation box size is set by box_size = 50.2. We employ the particle mesh Ewald method (PME) for the long range interaction calculation [2], which is specified by electrostatic = PME in the [ENERGY] section. The grid size used in PME is automatically determined by setting “pme_max_spacing“. Here, pme_max_spacing = 1.2 indicates that the grid size does not exceed 1.2 Å in such an automatic scheme. We use a cut-off distance of 12.0 Å for the non-bonded interactions, and a

Page 17: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

17

switching distance of 10.0 Å for the van der Waals interactions. This combination of 10-12 Å for switching and cutoff has been recently recommended in the case of the CHARMM C36 force field with the explicit solvent. Since top_all36_prot.rtf and par_all36m_prot.prm contain force field parameters for proteins, we additionally give a stream file (toppar_water_ions.str) in the [INPUT] section, which contains both topology and parameters for water and ions. In the [OUTPUT] section, we specify not only dcdfile but also rstfile. The rstfile file is called “restart file”, in which the coordinates of atoms at the last step of the energy minimization will be contained. This file will be used as the input file of the subsequent MD simulation. [INPUT] topfile = ../toppar/top_all36_prot.rtf # topology file parfile = ../toppar/par_all36m_prot.prm # parameter file strfile = ../toppar/toppar_water_ions.str # stream file psffile = ../1_setup/5_ionize/ionized.psf # protein structure file pdbfile = ../1_setup/5_ionize/ionized.pdb # PDB file [OUTPUT] dcdfile = min.dcd # DCD trajectory file rstfile = min.rst # restart file [ENERGY] forcefield = CHARMM # [CHARMM] electrostatic = PME # [PME] switchdist = 10.0 # switch distance cutoffdist = 12.0 # cutoff distance pairlistdist = 13.5 # pair-list distance vdw_force_switch = YES # force switch option for van der Waals pme_nspline = 4 # order of B-spline in [PME] pme_max_spacing = 1.2 # Max grid spacing allowed [MINIMIZE] method = SD # [SD] nsteps = 2000 # number of minimization steps eneout_period = 50 # energy output period crdout_period = 50 # coordinates output period rstout_period = 2000 # restart output period nbupdate_period = 10 # nonbond update period [BOUNDARY] type = PBC # [PBC] box_size_x = 50.2 # box size (x) in [PBC] box_size_y = 50.2 # box size (y) in [PBC] box_size_z = 50.2 # box size (z) in [PBC] Let’s carry out the energy minimization using spdyn. In the following commands, we use 4 MPI processors and 3 OpenMP threads, namely, 12 CPU cores in total. This is depending on the user’s computer environment, and see the Usage (https://www.r-ccs.riken.jp/labs/cbrt/usage/) page for details. In this case, it may take ~30

Page 18: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

18

seconds to finish the calculation. After the calculation, check the trajectory by using VMD. We can see that the atoms are slightly moved, but the atomic clashes are actually removed. # Run energy minimization $ export OMP_NUM_THREADS=3 $ mpirun -np 4 ../../../GENESIS/bin/spdyn INP > log # Check the trajectory $ vmd ../1_setup/5_ionize/ionized.pdb -psf ../1_setup/5_ionize/ionized.psf -dcd min.dcd Since DCD file is not a text file but a binary file, human cannot view the contents by the eyes. For the details of DCD file format, please refer to GENESIS homepage, https://www.r-ccs.riken.jp/labs/cbrt/tutorials2019/tutorial-4-1/. Let’s check the potential energy, which is written in the 3rd column of the INFO: lines in the log file. The followings are example commands to plot the potential energy by using gnuplot. First, we pick up the lines that contains “INFO:” from the log file by using the grep command, and write them in a new file “energy.log“. In the Figure, we can see that the potential energy is successfully decreased during the minimization. # Select lines that contain "INFO:" using the "grep" command $ grep "INFO:" log > energy.log $ less energy.log In the middle of log, we can see the energy of the system at each step, whose line starts with “INFO:“. Here, STEP is the number of minimization steps, the potential energy (POTENTIAL_ENE), and RMSG is the averaged force over all atoms (root-mean-square gradient). The subsequent values are the components of the potential energy of the CHARMM force field (bond, angle, Urey-Bradley, dihedral angle, improper torsion angle, CMAP, van der Waals, and Coulomb). In the log files from MD simulations, different parameters will be shown. : INFO: STEP POTENTIAL_ENE RMSG BOND ANGLE UREY-BRADLEY DIHEDRAL IMPROPER CMAP VDWAALS ELECT --------------- --------------- --------------- --------------- --------------- INFO: 0 254306.9272 13348.6017 7432.8708 1871.1490 24.1341 526.9444 3.6933 -33.3861 328716.6661 -84235.1446 INFO: 50 -38521.4679 761.3856 7370.8290 1907.8206 24.0121 526.8929 3.6344 -33.4396 36083.2027 -84404.4201 INFO: 100 -63267.2047 21.9616 5886.9670 1916.7500 22.6066 526.2761 3.0337 -34.0503 13046.5664 -84635.3543 INFO: 150 -69134.2996 11.6891 3812.3576 1809.3366 20.0396 525.0869 2.4381 -35.0795 10265.3326 -85533.8116 :

Page 19: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

19

# Check the potential energy change $ gnuplot gnuplot> set key autotitle columnhead gnuplot> set xlabel "Steps" gnuplot> set ylabel "Potential energy (kcal/mol)" gnuplot> plot 'energy.log' using 3 with lines

The root-mean-square gradient (RMSG) is another criteria to validate the energy minimization, which is written in the 4th column of the log file. Let’s plot the RMSG as well.

3. Equilibration

Our final goal in this tutorial is to perform an MD simulation of the peptide at T = 298.15 K and P = 1 atm in the NPT (isothermal-isobaric) ensemble. It is possible to directly start the MD simulation in the NPT ensemble. However, it is generally not recommended, because the initial structure was artificially constructed (in 1_setup), and such artifact can affect the accuracy or stability of the early stage of the MD simulation. Specifically, the density of water around the peptide in the initial structure may be deviated from the experimental or ideal value at T = 298.15 K and P = 1 atm. If we carry out NPT-MD simulation for such system just after the energy minimization, strong force may be generated on the atoms to control the temperature or pressure, which makes the simulation unstable. The target protein structure may be also broken due to such strong force. Therefore, gradual equilibration of the system is usually required. In this tutorial, we show a general scheme for the equilibration, and it is applicable to various systems. Let’s change directory for equilibration. In the directory, there are three control files, which will be executed one by one from INP1 to INP3. # Change directory for the equilibration $ cd ../3_equilibrate $ ls INP1 INP2 INP3

Step1. NVT-MD with positional restraint

Page 20: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

20

In Step1, we use INP1, and carry out a short MD simulation in the NVT ensemble, where the position of the peptide is fixed by using a positional restraint. Namely, only water molecules are equilibrated. Usage of the NVT ensemble allows us to moderately equilibrate the system than the NPT ensemble. The following is an important part in the control file. We carry out 50-ps MD simulation at T = 298.15 K. The equations of motion are integrated by the velocity Verlet algorithm with the time step of 2 fs, where the SHAKE/RATTLE [3,4] and SETTLE algorithms [5] are employed for bond constraint. The temperature was control with the Bussi thermostat [6]. The [ENERGY] section is common to the previous energy minimization. Note that the box size is NOT specified in the [BOUNDARY] section, because it is specified via the restart file (min.rst). Positional restraint is specified by using the [SELECTION] and [RESTRAINTS] sections. We select heavy atoms of the peptide by “sid:PROA and heavy” in the [SELECTION] section, where “sid” means segment index. As seen in ionized.pdb, the peptide has the segment name “PROA“. In the [RESTRAINTS] section, we turn on the positional restraint (POSI), and set the force constant of the restraint function to 1 kcal/mol/Å2. Reference coordinates of the positional restraint are given by reffile in the [INPUT] section. [INPUT] : pdbfile = ../1_setup/5_ionize/ionized.pdb # PDB file reffile = ../1_setup/5_ionize/ionized.pdb # reference PDB file for restraint rstfile = ../2_minimize/min.rst # restart file [DYNAMICS] integrator = VVER # [LEAP,VVER] nsteps = 25000 # number of MD steps timestep = 0.002 # timestep (ps) eneout_period = 500 # energy output period crdout_period = 500 # coordinates output period rstout_period = 25000 # restart output period nbupdate_period = 10 # nonbond update period [ENSEMBLE] ensemble = NVT # [NVE,NVT,NPT,NPAT,NPgT] tpcontrol = BUSSI # [NO,BERENDSEN,BUSSI,LANGEVIN] temperature = 298.15 # initial and target temperature (K) [BOUNDARY] type = PBC # [PBC] [SELECTION] group1 = sid:PROA and heavy # restraint group 1 [RESTRAINTS] nfunctions = 1 # number of restraint functions function1 = POSI # POSI: Positional restraint is employed direction1 = ALL # ALL: x,y,z-coordinates are restrained

Page 21: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

21

constant1 = 1.0 # force constant (kcal/mol/A^2) select_index1 = 1 # restrained groups Let’s execute spdyn for INP1, and view the obtained DCD file by using VMD. We can see that the peptide is almost fixed at the center of the system due to the positional restraint. Water molecules are spread out from the simulation box. This is actually no problem, because the periodic boundary condition is considered in the MD simulation. Water molecules outside the box are located in the “image” cells. In this course, these calculation take too long time (~10 minutes), so let’s try to run this step 1 (to see if you know how to run), but # Run the equilibration MD step1 $ export OMP_NUM_THREADS=3 $ mpirun -np 4 ../../../GENESIS/bin/spdyn INP1 > log1 # View the trajectory using VMD $ vmd ../1_setup/5_ionize/ionized.pdb -dcd eq1.dcd please copy the results of previously calculation MD simulations for step 2 and step 3: # copy files of the MD results $tar zxvf /lustrefs0/workshop/data/prac10genesis/tutorial-3.3-equi-results.tar.gz

If you want to wrap all molecules into the unit cell, the following VMD command is useful: # View the trajectory using VMD (all molecules are wrapped into the unit cell) $ vmd ../1_setup/5_ionize/ionized.pdb -dcd eq1.dcd vmd > pbc wrap -compound fragment -center origin -all Let’s check the time courses of the temperature (17th column in the log file) by using gnuplot. In our results, the initial temperature became very high (~450 K), presumably

Page 22: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

22

because unexpected fast velocities were generated on many water molecules due to strong forces as described above. The temperature was suddenly deceased to ~250 K, but gradually increased up to 300 K during 50 ps, implying that the water molecules were almost equilibrated.

If you want to equilibrate the system more moderately, you can gradually heat up the system from 0 to 298.15 K with the time step of 1 fs (see https://www.r-ccs.riken.jp/labs/cbrt/tutorials2019/tutorial-3-2/#Appendix for details).

Step2. NPT-MD with positional restraint In Step2, we switch the ensemble to NPT in order to regulate the box size. We specify “ensemble = NPT” in the [ENSENBLE] section. The target pressure P = 1 atm is also set in the same section. The other sections are common to Step1. We carry out 50-ps MD simulation at T = 298.15 K and P = 1 atm using the Bussi thermostat and barostat [6,7]. Again, the equations of motion are integrated by the velocity Verlet algorithm with the time step of 2 fs, where the SHAKE/RATTLE and SETTLE algorithms are employed for bond constraint. The positional restraint is still applied on the heavy atoms of the protein. Note that the restart file obtained in Step1 is specified in the [INPUT] section in order to continue MD using the coordinates and velocities from Step1. [ENSEMBLE] ensemble = NPT # [NVE,NVT,NPT,NPAT,NPgT] tpcontrol = BUSSI # [NO,BERENDSEN,BUSSI,LANGEVIN] temperature = 298.15 # initial and target temperature (K) pressure = 1.0 # target pressure (atm) Let’s execute spdyn for INP2, and view the obtained trajectory using VMD. # Run the equilibration MD step2 (do not run this. copy the results) $ mpirun -np 4 ../../../GENESIS/bin/spdyn INP2 > log2 # View the trajectory using VMD $ vmd ../1_setup/5_ionize/ionized.pdb -dcd eq2.dcd vmd > pbc wrap -compound fragment -center origin -all Let’s check the time courses of the box size (19th column in the log file), because it should be changed in the NPT ensemble. You can see that the simulation box showed quick

Page 23: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

23

shrinking in 5 ps, and oscillation after 10 ps. In this situation, the density of water should be well adjusted to a certain value at T = 298.15 K and P = 1 atom.

Let’s plot the pressure as well.

Step3. Pre-production run with restraint In Steps 1 and 2, we used the velocity Verlet integrator (VVER) with the time step of 2 fs, namely, “traditional scheme”. This is because we want to equilibrate the system carefully. However, in the subsequent production run, we are going to use the RESPA integrator (VRES) [8] with the time step of 2.5 fs, namely, “advanced scheme” to enhance the calculation or to extend the simulation time. In Step3, we equilibrate the system with the advanced scheme as a “pre-production” run. We carry out 50-ps MD simulation at T = 298.15 K and P = 1 atm using the Bussi thermostat and barostat. The positional restraint is still applied on the heavy atoms of the peptide. In the RESPA integrator, PME calculation is performed every 2 steps, and the update of the thermostat and barostat momenta is done every 10 steps. [DYNAMICS] integrator = VRES # [LEAP,VVER,VRES] nsteps = 20000 # number of MD steps (50ps) timestep = 0.0025 # timestep (2.5fs) eneout_period = 400 # energy output period (1ps) crdout_period = 400 # coordinates output period (1ps) rstout_period = 20000 # restart output period nbupdate_period = 10 # nonbond update period elec_long_period = 2 # period of reciprocal space calculation thermostat_period = 10 # period of thermostat update barostat_period = 10 # period of barostat update Let’s execute spdyn for INP3 using the same number of CPU cores as in Step 2. # Run the equilibration MD step3 $ mpirun -np 4 ../../../GENESIS/bin/spdyn INP3 > log3 We can easily recognize that the elapsed time in Step 3 is significantly reduced (~1.5 times faster) compared to Step2, although we carried out the same 50 ps MD simulation. The obtained restart file (eq3.rst) is used in the subsequent production run.

61.8

62

62.2

62.4

62.6

62.8

63

63.2

63.4

63.6

0 5 10 15 20 25 30 35 40 45 50

boxsize(A3)

time(ps)

boxx

Page 24: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

24

# Compare the computational time between Step2 and Step3 $ grep "dynamics " log2 log3 log2: dynamics = 343.291 log3: dynamics = 227.322 In general, it is difficult to decide when the equilibration should be stopped. In this tutorial, equilibration was stopped at 50 ps in each step, because the temperature was reached to the target in Step1, and the box size showed enough oscillation in Step2. The appropriate time should depend on the system. Long time equilibration (e.g., 10 ns or longer) is usually required for large systems or complicated systems such as membrane proteins.

4. Production We carry out the production run. Simulation condition is same with that in Step 3 of the previous equilibration, but the positional restraint is now turned off. We execute spdyn for INP1 to INP5 sequentially, each of which is corresponding to a 100-ps MD run. We will obtain 500-ps MD trajectories in total. Of course, you can use just one control file for 500 ps MD, where the MD step is set to 200,000. But, please keep in mind that “short sequential runs” are practically more convenient than the “long single run” in most cases. The purpose is mainly for quick recovering against computational or artificial accidents, or for finishing one job in a limited time. In the example below, we use 8 MPI processors with 2 OpenMP threads to run spdyn. However these numbers need to be adjusted based on the available computer resources and the system size by performing preliminary benchmark test (see https://www.r-ccs.riken.jp/labs/cbrt/tutorials2019/tutorial-3-3/#Benchmark_test for examples. # Change directory for production run $ cd ../4_production $ ls INP1 INP2 INP3 INP4 INP5 In this course, these calculation take too long time, so Commands for production runs # Production run for 0-100 ps (restart from eq3.rst) $ export OMP_NUM_THREADS=2 $ mpirun -np 8 ../../../GENESIS/bin/spdyn INP1 > log1 # Production run for 100-200 ps (restart from md1.rst) $ mpirun -np 8 ../../../GENESIS/bin/spdyn INP2 > log2 # Production run for 200-300 ps (restart from md2.rst) $ mpirun -np 8 ../../../GENESIS/bin/spdyn INP3 > log3 # Production run for 300-400 ps (restart from md3.rst) $ mpirun -np 8 ../../../GENESIS/bin/spdyn INP4 > log4 # Production run for 400-500 ps (restart from md4.rst)

Page 25: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

25

$ mpirun -np 8 ../../../GENESIS/bin/spdyn INP5 > log Instead, please copy the results of previously calculation MD simulations: # copy files of the MD results $tar zxvf /lustrefs0/workshop/data/prac10genesis/tutorial-3.4-production-results.tar.gz # Check the output files $ ls INP1 INP3 INP5 log2 log4 md1.dcd md2.dcd md3.dcd md4.dcd md5.dcd INP2 INP4 log1 log3 log5 md1.rst md2.rst md3.rst md4.rst md5.rst We obtain log1, md1.dcd, md1.rst from INP1 for the first 100 ps, …, and log5, md5.dcd, md5.rst from INP5 for the last 100-ps. Let’s view the trajectories by using VMD. All trajectory files can be loaded at the same time with the following commands. We can see that the peptide is fluctuating and sometimes undergoes conformational changes. # View the MD trajectories using VMD $ vmd ../1_setup/5_ionize/ionized.pdb -psf ../1_setup/5_ionize/ionized.psf -dcd md{1..5}.dcd The protein is quite stable, and did not show a large conformational change in 500 ps.

5. Analysis Now, we have 5 log files for the energy trajectories, and 5 DCD files for the coordinates trajectories. In this section, we learn how to analyze those multiple log files as well as multiple DCD files. We mainly learn how to process the coordinates trajectory files for effective analysis. We also analyze the root-mean-square deviation (RMSD) of the protein using the processed DCD file. # Change directory for analysis $ cd ../5_analysis $ ls 1_energy 2_crd_convert_wrap 3_crd_convert_pro 4_rmsd

Page 26: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

26

5.1 Energy We try to analyze the time courses of the temperature in 500 ps. The temperature is seen in the 16th column of the log1–log5 files. First, we pick up the “INFO:” lines from log1 by using the grep command, but exclude the first line (header information) by using “tail -n +2“. The results are written in “energy.log“. Second, we execute the same commands for log2, and the results are additionally written in “energy.log” by using “>>“. This is repeated until log5. Eventually, we obtain the combined log data. # Combine all log data $ cd 1_energy # Combine all log data $ grep "INFO:" ../../4_production/log1 | tail -n +2 > energy.log $ grep "INFO:" ../../4_production/log2 | tail -n +2 >> energy.log $ grep "INFO:" ../../4_production/log3 | tail -n +2 >> energy.log $ grep "INFO:" ../../4_production/log4 | tail -n +2 >> energy.log $ grep "INFO:" ../../4_production/log5 | tail -n +2 >> energy.log Let’s plot the time courses of the temperature by using gnuplot. In the previous tutorial, we used the 3rd column for the X-axis, which is simply corresponding to the simulation time. However, in the combined log file, we cannot use the 3rd column, because the time was reset to 0 ps after the restart (see the first step in log2). Accordingly, we use the “0th” column for plotting, which is corresponding to the “line number”. In energy.log, there are 1,000 lines. If the 0th column is divided by two, the line number can be converted to time, which is specified by “($0/2)” in the gnuplot commaned: # Plot the temperature change $ gnuplot gnuplot> unset key gnuplot> set xlabel "Time (ps)" gnuplot> set ylabel "Temperature (K)" gnuplot> plot 'energy.log' u ($0/2):16 with lines

Page 27: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

27

We can see that the temperature is fluctuating around a certain value. The averaged temperature is actually close to the target temperature. # Compute the averaged temperature over 500 ps $ awk '{sum+=$16} END {print sum/NR}' energy.log 297.781

5.2 Make a trajectory file with PBC wrapping If we load the DCD trajectories in VMD, we can see that water molecules are spread out from the simulation box, and Protein G also shows translation and rotation, which makes it inconvenient to analyze the trajectories. By using the “crd_convert” tool in the GENESIS program package, we can convert such trajectories to those in which all molecules are wrapped into the unit cell, and the target protein is re-centered.

Let’s change directory to “2_crd_convert_wrap” for this process. The control file is already contained in the directory. The following shows the most important part in the control file. When the trajectory is converted, center of mass of the protein is shifted to the origin (blue options), and all molecules are wrapped into the unit cell (red options). Here, psffile should be specified in the [INPUT] section to wrap molecules in the case of the CHARMM force field. The coordinates of all atoms in the system are output to a new dcdfile “output.dcd” (green options). In the original 5 DCD files, there are 1,000 snapshots in total (200 snapshot in each). To reduce the file size, we set the trajectory analysis period to 800 instead of using the original crdout_period (purple option). Accordingly, the new dcd file contains 250 snapshots in total. [INPUT] psffile = ../../1_setup/5_ionize/ionized.psf # protein structure file reffile = ../../1_setup/5_ionize/ionized.pdb # PDB file [OUTPUT] pdbfile = output.pdb # PDB file

Page 28: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

28

trjfile = output.dcd # trajectory file [TRAJECTORY] trjfile1 = ../../4_production/md1.dcd # trajectory file : trjfile5 = ../../4_production/md5.dcd # trajectory file md_step1 = 40000 # number of MD steps mdout_period1 = 200 # MD output period (crdout_period) ana_period1 = 800 # analysis period repeat1 = 5 [SELECTION] group1 = sid:PROA # selection group 1 group2 = all # selection group 2 [OPTION] centering = YES # shift center of mass centering_atom = 1 # atom group center_coord = 0.0 0.0 0.0 # target center coordinates pbc_correct = MOLECULE # (NO/MOLECULE) trjout_format = DCD # (PDB/DCD) trjout_type = COOR+BOX # (COOR/COOR+BOX) trjout_atom = 2 # atom group Let’s execute crd_convert, and check the obtained DCD file with VMD. This file is useful to analyze protein-water interactions, because in the original DCD files water molecules might be interacting with the protein even though they are located in image cells (or far from the protein). # Convert the trajectory (wrap molecules) $ ../../../GENESIS/bin/crd_convert INP > log $ ls INP log output.pdb output.dcd # Check the trajectory using VMD $ vmd output.pdb -dcd output.dcd

5.3 Make a trajectory file including protein only

By using crd_convert, we can also make a new DCD file in which only protein atoms are contained while the Cα atoms are fitted to the initial structure.

Page 29: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

29

Let’s change directory to “3_crd_convert_pro“. The control file is already contained in the directory. The following shows the most important part in the control file. We select the protein atoms (red options), and write them into a new DCD file (output.dcd), where the Cα atoms are fitted to the initial coordinates (ionized.pdb) by rigid-body translation and rotation (TR+ROT) (blue options). We analyze all snapshots in the original 5 DCD files, each of which contains 200 snapshots, resulting in 1,000 frames in the new DCD file. [SELECTION] group1 = an:CA # selection group 1 group2 = sid:PROA # selection group 2 [FITTING] fitting_method = TR+ROT # NO/TR+ROT/TR/TR+ZROT/XYTR/XYTR+ZROT fitting_atom = 1 # atom group [OPTION] trjout_format = DCD # (PDB/DCD) trjout_type = COOR+BOX # (COOR/COOR+BOX) trjout_atom = 2 # atom group Let’s execute crd_convert, and check the obtained DCD file with VMD. This file is useful to analyze the protein structure itself, because we do not need the coordinates of water molecules for such purpose. In the original system, there are a lot of water molecules. If we use the original DCD file for the trajectory analysis, it takes much time to read the coordinates of water molecules, which is a waste of time. In the next section, we are going to analyze the RMSD by using the obtained DCD file instead of using the original DCD files.

5.4 Root-mean-square deviation (RMSD) To examine structural stability of the protein, we analyze the RMSD of the Cα atoms with respect to the initial structure (../2_crdconvert/output.pdb). We use the DCD file

Page 30: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

30

obtained in the previous sub-section. In the analysis, each snapshot is fitted to the initial structure with the rigid-body translation and rotation (TR+ROT). $ less INP [INPUT] reffile = ../3_crd_convert_pro/output.pdb # PDB file [TRAJECTORY] trjfile1 = ../3_crd_convert_pro/output.dcd # trajectory file md_step1 = 1000 # number of MD steps mdout_period1 = 1 # MD output period ana_period1 = 1 # analysis period repeat1 = 1 : [SELECTION] group1 = an:CA # selection group 1 [FITTING] fitting_method = TR+ROT # NO/TR+ROT/TR/TR+ZROT/XYTR/XYTR+ZROT fitting_atom = 1 # atom group [OPTION] analysis_atom = 1 # atom group # run RMSD analysis $ rmsd_analysis INP > log We can see that the RMSD is less than 1 Å over 500 ps, suggesting that the protein structure is quite stable. However, the RMSD is still gradually increasing, and longer simulation time might be necessary to get a convergence.

1. J. Huang et al., Nat. Methods, 14, 71-73 (2017). 2. T. Darden et al., J. Chem. Phys., 98, 10089-10092 (1993). 3. J. P. Ryckaert et al., J. Comput. Phys., 23, 327-341 (1977). 4. H. Andersen, J. Comp. Phys., 52, 24-34 (1983). 5. S. Miyamoto and P. A. Kollman, J. Comput. Chem., 13, 952-962 (1992). 6. G. Bussi et al., J. Chem. Phys., 126, 014101 (2007).

Page 31: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

31

7. G. Bussi et al., J. Chem. Phys., 130, 074101 (2009). 8. M. Tuckerman et al., J. Chem. Phys., 97, 1990-2001 (1992).

4.1 Coarse-grained MD simulation with the All-Atom Go-model In this tutorial, we have covered the usage of CHARMM force field with GENESIS. However, CHARMM is not the only option. AMBER is another commonly used force field parameter set [1]. In order to use these parameters for flexible fitting, which is a variation of MD simulations, we need to go through the set up procedure we have covered in this tutorial. It is often not trivial for complex biological systems due to, for example, missing residues or model inaccuracy. MD simulation itself, even without cryo-EM data, can be unstable. One approach to simplify the flexible fitting is the use of structure-based potentials, in which the force field parameters are set up to stabilize the given protein structure. Note that this is not a “generic” parameter set; for each protein structure, a set of parameters are determined based on the structure itself. Therefore, some properties cannot be discussed with this, for example, energetic stability of different conformations. One of such parameters we have previously used is All-Atom Go model [2], and SMOG webserver (http://smog-server.org) [3] can be conveniently used to prepare the parameter files from PDB file. Please copy tutorial file and extract the tar-ball into your working directory: # Download the tutorial file $ cd /path/to/Tutorials $ cp /lustrefs0/workshop/data/prac10genesis/tutorial-4.1.tar.gz . $ tar -xzvf tutorial-4.1.tar.gz $ cd tutorial-4.1 $ ls 1_construction 2_production 3_analysis This tutorial consists of three steps:

1. system setup, 2. production run, 3. trajectory analysis.

1. Setup the system To build a simulation system, we start from downloading a PDB file of protein G from the RCSB Protein Data Bank. You can download the PDB file via a web browser, or use the following commands: # change to the construction directory $ cd /path/to/1_construction # download the PDB file (PDB code 1PGB)

Page 32: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

32

$ wget https://files.rcsb.org/download/1PGB.pdb SMOG server requires that the input PDB file is formatted in to contain only the atom information. The downloaded PDB file has to be “cleaned” by removing everything except the lines starting with the ATOM, TER, END. If it has a TER line just before the END, TER line should be removed (i.e. keep only END) Access the SMOG webserver to generate all-atom Go parameter files. http://smog-server.org/cgi-bin/GenTopGro.pl Most of parameters can be left as the default, exept a few. 1. Upload the PDB file 1pgb.pdb 5. If you don't plan to use periodic boundary conditions, then you can check this box and your system will not be placed in a box (this will override the above spacing definitions) This should be ‘checked’, since all-atom Go model, we do not simulate water molecules and thus periodic boundary condition is not necessary. 12. What nickname (20 character max) would you like to give this system? This can be any text-label, for example, “ff”. Press “Submit”. If successful, you will see the message “Generation of files completed.”. Download the generated file from “You can download the files HERE”. If not successful you will error messages. Most of the time, it is due some records in PDB file that cannot be processed by SMOG server, and there is a note regarding the problems in the PDB file. Usually, keeping only ATOM records with appropriate TER and END records solve the problem. (If unsuccessful, please copy an example from /lustrefs0/workshop/data/prac10genesis/). Extract the tar-ball into the working directory and you will find the coarse-grained coordinate, topology and parameter files. # extract the tar-ball file (the number will be different) $ tar xvf /path/to/ff.25164.pdb.tar.gz The main output from SMOG are .gro and .top file. These are the files used in GROMACS MD package. GENESIS can use these files using suitable options.

2. Production simulation Coarse-grained simulations are usually not very sensitive to the initial configuration. Therefore here we perform a production simulation without energy minimization or equilibration. We are performing just 106 time steps of MD simulation, which should take about 5 minutes. (WARNING! This simulation is set up to see some results quickly. A lot more simulations are necessary to obtain reliable results) # change to the production directory $ cd /path/to/2_production/

Page 33: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

33

The control file (run.inp) contains several sections, such as [INPUT], [OUTPUT], and [ENERGY], where we can specify the control parameters for the simulation. In the [INPUT] section, we set the file names for the grotopfile (topology file), the grocrdfile (the initial structure) (see section 4.1 of the GENESIS manual for an explanation of each input file). In the [OUTPUT] section, output filenames are set. atdyn does not create any output file unless we explicitly specify their names. Here in our example, the rstfile (restart file) and the dcdfile (binary trajectory file) are set (see section 4.2 of the GENESIS manual for an explanation of each output file). In the [ENERGY] section, we specify the parameters related to the energy and force evaluation. AAGO is the name for the All-Atom Go-model in GENESIS. We set the cutoff values (switchdist=15, cutoffdist=15, pairlistdist=25). The user can consider a different value to balance computational efficiency and accuracy. The [DYNAMICS] section sets up the parameters for the MD engine of atdyn. For the All-Atom Go model, time step can be set to 0.5 fs (not really fs. This is in ‘reduced units’. In the [ENSEMBLE] section, the LANGEVIN thermostat is chosen for an isothermal simulation with the friction constant of 0.01 ps-1. Finally, in the [BOUNDARY] section, we set the boundary condition for the system, which is no boundary condition (NOBC) here. [INPUT] grotopfile = ../1_construction/ff.25164.pdb.sb/ff.25164.pdb.top

# topology file grocrdfile = ../1_construction/ff.25164.pdb.sb/ff.25164.pdb.gro

# initial coordinate [OUTPUT] dcdfile = run.dcd # DCD trajectory file rstfile = run.rst # restart file [ENERGY] forcefield = AAGO electrostatic = CUTOFF switchdist = 15.0 # in KBGO, this is ignored cutoffdist = 15.0 # cutoff distance pairlistdist = 25.0 # pair-list cutoff distance [DYNAMICS] integrator = LEAP # [LEAP,VVER] nsteps = 1000000 # number of MD steps timestep = 0.0005 # timestep (ps) eneout_period = 10000 # energy output period rstout_period = 10000 # restart output period crdout_period = 10000 # coordinates output period nbupdate_period = 10000 # nonbond update period

Page 34: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

34

[CONSTRAINTS] rigid_bond = NO [ENSEMBLE] ensemble = NVT # [NVE,NVT,NPT] tpcontrol = LANGEVIN # thermostat temperature = 115 # initial and target # temperature (K) gamma_t = 0.1 # thermostat friction (ps-1) # in [LANGEVIN] [BOUNDARY] type = NOBC # [PBC, NOBC] We now perform CG MD simulation with all-atom Go model. # set the number of OpenMP threads $ export OMP_NUM_THREADS=3 # perform production simulation with ATDYN by using 4 MPI processes $ mpirun -np 4 atdyn run.inp | tee run.out The resulting MD trajectory can be viewed by VMD as $ vmd ../1_construction/ff.25164.pdb.sb/ff.25164.pdb.gro run.dcd

3. Analysis

3.1. RMSD calculation As an example for the analysis, we use the crd_convert, which is one of the post-processing programs in GENESIS. crd_convert is a utility to calculate various quantities from a trajectory. The following commands calculate RMSD of the simulation trajectory. # change to the analysis directory $ cd /path/to/3_analysis/ # set the number of OpenMP threads $ export OMP_NUM_THREADS=1 # perform analysis with crd_convert $ crd_convert run.inp | tee run.out The control file for the RMSD calculation is shown below: [INPUT] grotopfile = ../1_construction/ff.25164.pdb.sb/ff.25164.pdb.top

# topology file grocrdfile = ../1_construction/ff.25164.pdb.sb/ff.25164.pdb.gro

# coordinate

Page 35: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

35

[OUTPUT] rmsfile = run.rms # RMSD file [TRAJECTORY] trjfile1 = ../2_production/run.dcd # trajectory file md_step1 = 1000000 # number of MD steps mdout_period1 = 1000 # MD output period ana_period1 = 1 # analysis period trj_format = DCD # (PDB/DCD) trj_type = COOR # (COOR/COOR+BOX) [SELECTION] group1 = an:CA # selection group 1 [FITTING] fitting_method = TR+ROT # method fitting_atom = 1 # atom group [OPTION] check_only = NO # (YES/NO) In the [TRAJECTORY] section, we set the input trajectory file as trjfile1=../2_production/run.dcd. In the [SELECTION] section, we define a group (group1) of all Ca in the model which are used for the RMSD calculation. In the [FITTING] section, the fitting method is specified. In this case, translations (TR) and rotations (ROT) are used for the fitting. Finally the RMSD output file (rmsfile=run.rms) is set in the [OUTPUT] section. By running crd_convert, we get an RMSD file (run.rms). The first column of this file is the time step, and the second column is the RMSD values in the unit of Angstroms. One can visualize it with programs such as gnuplot: $ gnuplot gnuplot> set xlabel "Step" gnuplot> set ylabel "RMSD [Angstrom]" gnuplot> plot "run.rms" w lp We probably see the RMSD value increases as you can see by VMD that the protein unfold.

5.2 Distance Second, we try to analyze the distance between selected atoms in the protein. Here, we select the “CA” atoms in the N-terminus and C-terminus, respectively. The control file for the analysis is already contained in the directory. Here, we use the “trj_analysis” tool, as in the Subsection 3.2 in Tutorial 3.1. $ cd ../5_distance $ ls INP

Page 36: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

36

# View the control file $ less INP

In the [TRAJECTORY] section, we set md1.dcd, md2.dcd. …, and md5.dcd. In each production run, we carried out 1,000,000-steps MD with crdout_period = 1000, which are specified as md_step1 and mdout_period1, respectively. Since we want to analyze all snapshots, ana_period1 is set to be equal to mdout_period1. In the [OPTION] section, we select the atoms to be analyzed. [INPUT] psffile = ../../1_setup/5_ionize/ionized.psf # protein structure file reffile = ../../1_setup/5_ionize/ionized.pdb # PDB file [OUTPUT] disfile = output.dis # distance file [TRAJECTORY] trjfile1 = ../../4_production/md1.dcd # trajectory file md_step1 = 1000000 # number of MD steps mdout_period1 = 1000 # MD output period ana_period1 = 1000 # analysis period repeat1 = 1 trj_format = DCD # (PDB/DCD) trj_type = COOR+BOX # (COOR/COOR+BOX) trj_natom = 0 # (0:uses reference PDB atom count) [OPTION] distance1 = Marc:1:MET:CA Marc:56:GLU:CA Let’s execute trj_analysis for this control file. You can obtain “output.dis”, in which the results of the distance calculation is output in the 2nd column. # Analyze the time courses of dihedral angle PHI and PSI $ /home/user/GENESIS/bin/trj_analysis INP > log $ ls INP log output.dis # Time courses of the dihedral angles phi and psi $ gnuplot gnuplot> set encoding iso gnuplot> set xlabel 'Time (ps)' gnuplot> set ylabel 'Distance (\305)' gnuplot> plot 'output.dis' u ($1/2):2 t "OY-HNT distance" with lines We can see that the distance is fluctuating around 10 Å. This is a simple example of trj_analysis tool, and other measurement also be performed, such as dihedral angles. However if it is not sufficient, your own program needs to be prepared. References 1. W. D. Cornell, et al, J. Am. Chem. Soc., 117, 5179–5197, (1995).

Page 37: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

37

2. P.C. Whitford, et al, Proteins, 75, 430–441, (2009). 3. J. K. Noel, et al, Nucleic Acids Res., 38, W657–W661, (2010).

5. Cryo-EM flexible fitting using a simulated map Cryo-electron microscopy (Cryo-EM) is a powerful tool to determine three-dimensional structures of biomolecules at near atomic resolution. Flexible fitting has been widely utilized to model the atomic structure from the experimental density map [1]. One of the commonly used methods is the MD-based flexible fitting [2]. In GENESIS, c.c.-based flexible fitting is available in ATDYN and SPDYN [3]. In addition to the conventional methods, flexible fitting with the replica-exchange schemes (REUSfit [4]) or with the all-atom Go-model (MDfit [5]) is available in GENESIS. Here, we introduce a basic procedure to perform MD-based flexible fitting using a simulated density map with the all-atom Go model.

1. Preparation for flexible fitting MD for demonstration Obtain the tutorial file (tutorial-5.1.tar.gz). $ cd Tutorials $ cp /lustrefs0/workshop/data/prac10genesis/tutorial-5.1.tar.gz ./ $ tar -xvzf tutorial-5.1.tar.gz $ cd tutorial-5.1 $ ls 1_emmap 2_build 3_minimize 4_fitting 5_analysis

Making a simulated density map of the target state for demonstration Before dealing with a “real experimental” density map, let’s practice the flexible fitting using a “simulated” density map. Here, we employ the glucose/galactose binding protein (GGBP). Recently, the open and closed states of GGBP have been solved with the X-ray crystallography. We would like to create a target density map from the crystal structure of the open state (PDB: 2FW0), and carry out the flexible fitting from the closed state (PDB: 2FVY). Therefore, we know the “answer” of the fitting. First, we download the PDB file of the open state (PDB: 2FW0), and create a processed PDB file that contains only protein heavy atoms by using VMD. # Download PDB file of the open state of GGBP $ cd 1_emmap $ wget https://files.rcsb.org/download/2FW0.pdb # Generate a new PDB file that contains protein heavy atoms $ vmd -dispdev text 2FW0.pdb vmd > set sel [atomselect top "protein and not hydrogen and not altloc B"] vmd > $sel writepdb open.pdb vmd > exit

Page 38: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

38

To generate a synthetic density map from open.pdb, we use emmap_generator in the GENESIS analysis tool. We generate a sample control file INP using the “-h ctrl” option, and edit the file. # Generate a control file of emmap_generator $ ../../GENESIS/bin/emmap_generator -h ctrl > INP $ ls 2FW0.pdb INP open.pdb Let’s edit INP. We specify the following parameters in the control file. We create a map with the 5 Å resolution (sigma = 2.5 Å), where the file format is set to SITUS. (Currently, GENESIS accepts only Situs format. map2map utility in Situs can be used to convert from other common formats). The variable auto_margin makes a margin between the map edge and minimum or maximum coordinates of protein atoms according to the specified margin_size. Note that the original PDB file (2FW0.pdb) cannot be used as input in this tool. # Control file of emmap_generator # [INPUT] pdbfile = open.pdb [OUTPUT] mapfile = open.sit [OPTION] check_only = NO # (YES/NO) allow_backup = NO # (YES/NO) map_format = SITUS # (SITUS) auto_margin = YES # (YES/NO) margin_size_x = 20.0 # margin size margin_size_y = 20.0 # margin size margin_size_z = 20.0 # margin size voxel_size = 1.0 # voxel size sigma = 2.5 # resolution parameter tolerance = 0.001 # tolerance (0.001 = 0.1%)

After running emmap_generator for INP, let’s view the generated density map and open.pdb by using VMD. The following is an example to load the map file by the command line. This can be also done by using a mouse as follows [File > New Molecule > Browse > Select “open.sit” > Load]. # Generate synthetic EM density map (open.sit) $ ../../GENESIS/bin/emmap_generator INP > log $ ls 2FW0.pdb INP log open.pdb open.sit # Visualization by using VMD $ vmd -situs ../1_emmap/open.sit vmd > mol load pdb open.pdb

Page 39: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

39

Comparison between the PDB and synthetic density map. Let’s change the representation of the density map by specifying “Wireframe” or by moving a seek bar of the “Isovalues”, which show the absolute value of the density data.

2. Building the initial (unfitted) structure Second, we prepare the initial structure of the closed state (PDB: 2FVY). Again, we download the PDB file and create a processed PDB file that contains only protein heavy atoms: # Download PDB file of the closed state of GGBP $ cd ../2_build/ $ wget https://files.rcsb.org/download/2FVY.pdb $ ls 2FVY.pdb # Generate PDB file containing protein heavy atoms $ vmd -dispdev text 2FVY.pdb vmd > set sel [atomselect top "protein and not hydrogen and not altloc B"] vmd > $sel writepdb proa.pdb vmd > exit $ ls 2FVY.pdb proa.pdb Let’s visualize proa.pdb and open.sit by using VMD. We can see that the position of the initial structure is significantly shifted from the target density. We have to fit the initial structure to the map as much as possible with a rigid docking protocol before the flexible

Page 40: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

40

fitting. For this purpose, we use the colores program in SITUS (for details, see here). We run the following command, where the resolution of 5 Å and 4 CPU cores are specified. It may take a few minutes for this calculation. # Perform rigid docking using SITUS $ /opt/ohpc/pub/apps/Situs_3.1/bin/colores ../1_emmap/open.sit proa.pdb -res 5 -nprocs 4 We obtained several PDB files, named col_best_00X.pdb. We should select the best model among them. The cross-correlation coefficient (c.c.) between the target and simulated densities is written in the header information of each PDB file. We see that col_best_001.pdb shows the best c.c., which means that this PDB is the best fitted model. # Check the c.c. value in the obtained PDB files $ ls col_*.pdb col_best_001.pdb col_best_003.pdb col_best_005.pdb col_best_007.pdb col_best_002.pdb col_best_004.pdb col_best_006.pdb $ grep "Unnormalized correlation coefficient:" col_*.pdb col_best_001.pdb:REMARK Unnormalized correlation coefficient: 0.876832 col_best_002.pdb:REMARK Unnormalized correlation coefficient: 0.798847 col_best_003.pdb:REMARK Unnormalized correlation coefficient: 0.798821 col_best_004.pdb:REMARK Unnormalized correlation coefficient: 0.764940 col_best_005.pdb:REMARK Unnormalized correlation coefficient: 0.751088 col_best_006.pdb:REMARK Unnormalized correlation coefficient: 0.750453 col_best_007.pdb:REMARK Unnormalized correlation coefficient: 0.405601

Let’s compare col_best_001.pdb and open.sit by using VMD to check whether the protein is actually superimposed onto the map. We use col_best_001.pdb as the initial coordinates of the flexible fitting. We will use All-atom Go model to perform in the flexible fitting. Create the parameter files using SMOG server as it was done in the previous Tutorial.

3. Flexible fitting Then, we carry out the flexible fitting. The sample control file is already included in the directory. # Change directory to run the simulation $ cd ../3_fitting $ ls INP # Take a look at the control file $ less INP The following shows the most important parts in the control file. You can recognize that the flexible fitting is a kind of “restrained MD simulation”, and most parameters are common with the conventional MD simulations. We apply the biasing potential on the protein heavy

Page 41: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

41

atoms, where the force constant of the bias is set to 5,000 kcal/mol. emfit_sigma is a resolution parameter, which is usually set to the half of the resolution of the target density map. With this force, 5000 MD steps is sufficient, but for some complex conformational transition, many more steps are necessary. Here we set the temperature 20 (reduced unit), so that proteins do not unfold during the fitting process. [INPUT] grotopfile = ../2_build/ff.25165.pdb.sb/ff.25165.pdb.top # topology file grocrdfile = ../2_build/ff.25165.pdb.sb/ff.25165.pdb.gro # parameter file [DYNAMICS] integrator = LEAP # [LEAP,VVER] nsteps = 5000 # number of MD steps timestep = 0.0005 # timestep (ps) eneout_period = 100 # energy output period rstout_period = 100 # restart output period crdout_period = 100 # coordinates output period nbupdate_period = 100 # nonbond update period [ENSEMBLE] ensemble = NVT # constant temperature tpcontrol = LANGEVIN # Langevin thermostat temperature = 20 gamma_t = 1 # friction coefficient (ps-1) [SELECTION] group1 = all and not hydrogen [RESTRAINTS] nfunctions = 1 function1 = EM # apply restraints from EM density map constant1 = 5000 # force constant in Ebias = k*(1 - c.c.) select_index1 = 1 # apply restraint force on protein heavy atoms [EXPERIMENTS] emfit = YES # perform EM flexible fitting emfit_target = ../1_emmap/open.sit # target EM density map emfit_sigma = 2.5 # half of the map resolution (5 A) emfit_tolerance = 0.001 # Tolerance for error (0.1%) emfit_period = 1 # emfit force update period We execute ATDYN for INP. The following is an example to execute mpirun using 12 CPU cores: # Perform flexible fitting $ export OMP_NUM_THREADS=3 $ mpirun -np 4 /home/user/GENESIS/bin/atdyn INP > log $ ls

Page 42: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

42

INP log run.dcd run.pdb run.rst After the calculation, we obtain run.dcd, run.pdb, and run.rst, where the PDB file is the structure at 50 ps. The log file contains time courses of the energy and c.c. (Column 19: RESTR_CVS001). Let’s view the trajectory by using VMD. You can see that the structure is well fitted to the target density. $ vmd -situs ../1_emmap/open.sit vmd > mol load gro ../2_build/ff.25165.pdb.sb/ff.25165.pdb.gro vmd > mol addfile run.dcd

Comparison between the Initial (green) and fitted (red) structures.

3. Trajectory analysis

Cross-correlation coefficient and RMSD First, we analyze the time courses of c.c., which can be easily obtained by using the following command. We select the value in the Column 16 (RESTR_CVS001) in the log file. You can see that the c.c. is successfully increased during the fitting. # Change directory for analysis $ cd ../4_analysis # Analyze time courses of c.c. $ grep "INFO:" ../3_fitting/log | awk '{print $2, $16}' | grep -v "RESTR" > cc.log $ less cc.log

Page 43: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

43

Second, we analyze the RMSD with respect to the target structure, because we know what should be the correct “answer” in this flexible fitting. Here, we use rmsd_analysis in the GENESIS analysis tool set. We use target.pdb as reference coordinates. Please, confirm that the number of atoms as well as the order of atoms in target.pdb and closed.pdb are identical to each other, otherwise the atom selection or RMSD calculation cannot be done correctly. # Analysis of the RMSD $ ../../../GENESIS/bin/rmsd_analysis INP $ ls INP cc.log output.rms We obtain output.rms. Let us plot the RMSD and c.c. by using gnuplot. We can see that the RMSD with respect to the target state is quickly decreased as the c.c. increased. The c.c. is eventually converged in 5000 steps. # Plot the c.c. and RMSD data $ gnuplot gnuplot> set encoding iso gnuplot> set y2tics gnuplot> set ytics nomirror gnuplot> set xlabel "Steps" gnuplot> set ylabel "c.c." gnuplot> set y2label "RMSD (\305)" gnuplot> unset key gnuplot> plot 'cc.log' u ($0*100):2 w l,'output.rms' u ($0*100):2 w l axis x1y2

Time courses of c.c. (purple line) and Cα RMSD with respect to the target structure (green line).

MolProbity score Finally, we evaluate the quality of the obtained structure. One of the commonly used criteria is the MolProbity score [7], which represents how good the structure is as a protein. It can be computed with the GUI server. Let’s access to the MolProbity Server, http://molprobity.biochem.duke.edu.

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

0 1000 2000 3000 4000 50000

0.5

1

1.5

2

2.5

3

3.5

4

c.c.

RMSD(Å)

Steps

Page 44: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

44

1. Upload run.pdb obtained from the flexible fitting.

2. Run “Add hydrogens”. 3. After this process, we click “Analyze geometry without all-atom contacts”. 4. We found that the MolProbity score of run.pdb was 1.69.

There is still a room to improve the structural quality. If we use a simulated annealing protocol in the flexible fitting, or if we further carry out an energy minimization after the flexible fitting, the score should be improved.

4. Survey of Biasing Force constants In the previous tutorial we performed flexible fitting using all atom GO model using biasing force toward Cryo EM volume map. One critical parameter is the biasing force constant (along with sampling length and efficiency) [4]. The force needs to be strong enough to obtain fitted model, but too strong forces could induce structural distortions. Therefore we try to select an appropriate force constant by performing flexible fitting with a variation of force constant. Here we will demonstrate simple force constant survey procedure.

Page 45: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

45

We will prepare input files with multiple force constants and repeat calculations (a script is provided). $ cd ../5_force_survey # testing 6 force constants $ for k in 1000 3000 5000 7000 9000 11000 $ do $ sed /constant1/s/5000/$k/ ../3_fitting/INP > INP.$k $ export OMP_NUM_THREAD=3 $ mpirun -np 4 ../../../GENESIS/bin/atdyn INP.$k >& log.$k $ mv run.dcd run.$k.dcd $ mv run.rst run.$k.rst $ mv run.pdb run.$k.pdb $ done We will examine the trajectory of cc and RMSD for each run and assemble data for gnuplot (a script is provided) . $ echo "" > last_frame.dat $ for k in 1000 3000 5000 7000 9000 11000 $ do $ grep INFO log.$k | tail -n +3 >& energy.$k $ sed -e s%../3_fitting/run.dcd%run.$k.dcd% -e \ s/output.rms/output.$k.rms/ ../4_analysis/INP > rmsd.$k.inp $ ../../../GENESIS/bin/rmsd_analysis rmsd.$k.inp > rmsd.$k.log # combine c.c. and rmsd data $ paste energy.$k output.$k.rms > comb.$k.dat # assemble the data for the last frame $ echo $k > a $ tail -n 1 comb.$k.dat | sed s/INFO:// > b $ paste a b >> last_frame.dat $ done We will examine the trajectory of cc and RMSD for each run. Gnuplot scripts cc_traj.gnuplot and rmsd_traj.gnuplot are provided. $ gnuplot gnuplot > ‘cc_traj.gnuplot’ gnuplot > ‘rmsd_traj.gnuplot’

Page 46: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

46

We can see the convergence of c.c. is strongly affected by the force constant. However, RMSD is less affected. Now we focus on the last frame of the fitting and plot how the force constant affects the final c.c. and RMSD values. (Gnuplot script is provided). $ gnuplot gnuplot > ‘k_vs_cc_rmsd.gnuplot’

From this plot of final c.c. vs force constant we would select about 5000, since c.c. values are not much improved beyond that force constant. Indeed, the RMSD value is sufficiently low at the point (Note that in actual applications, without the known final model, we cannot calculate RMSD). Lastly we will examine the Molprobity scores of a few resulting models. The Molprobity statistics for the model with 11000 various properties worth than the model with 5000, which clearly demonstrate that unnecessary strong force constans can cause artificial distortion in the model.

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

0 5 10 15 20 25 30 35 40 45 50

c.c.

steps

k1000k3000k5000k7000k9000k11000

0

0.5

1

1.5

2

2.5

3

3.5

4

0 5 10 15 20 25 30 35 40 45 50

RMSD(A)

steps

k1000k3000k5000k7000k9000k11000

0.982

0.984

0.986

0.988

0.99

0.992

0.994

0.996

0.998

0 2000 4000 6000 8000 10000 120000.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

c.c.

RMSD(A)

biasingforcek

c.c.RMSD(A)

Page 47: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

47

Molprobity score of the fitted model using k=11000

Molprobity score of the fitted model using k=1000 A few last remarks:

• Additional factor we need to consider is the sampling efficiency. Previously, we found that for complex conformational transitions, a large number of fitting trials need to be repeatedly performed to obtain reliable and better fitted models.

• In this fitting tutorial, we used all-atom Go model. However, it is possible to use the standard CHARMM or AMBER potential, in solvent or using Generalized Born implicit solvent model. See https://www.r-ccs.riken.jp/labs/cbrt/tutorials2019/tutorial-17-1/

Page 48: EMBO Practical Course CEM3DIP 2020 Tutorial 10: Flexible ......+ bin # Symbolic link to the binary files of GENESIS + Tutorials # All tutorials are done in this directory + Others

Flexible fitting tutorial

48

References 1. F. Tama et al., J. Mol. Biol., 337, 985-999 (2004). 2. M. Orzechowski and F. Tama, Biophys. J., 95, 5692-5705 (2008). 3. T. Mori et al., Structure, 27, 161-174.e3 (2019). 4. O. Miyashita et al., J. Comput. Chem., 38, 1447-1461 (2017). 5. P. Whitford et al., PNAS, 108, 18943-18948 (2011).