target selection and current status of structural genomics ... · of structural genomics is to...
TRANSCRIPT
33
Target selection and current status of structural genomics for the completed microbial genomes
3.1 Introduction………………………………….….…………………….……………. 3.1.1 Target Selection………………………….…………………………………. 3.1.2 Expected Results…………..….………….………………………………… 3.1.3 Limitations ……………..…….……………………………………………… 3.2 Structural status of completed microbial genomes in the PDB..………….. 3.3 Metabolic pathways as targets for structural genomics…..……….……….. 3.3.1 Glycolytic pathway…………………………………………..……………... 3.3.2 Glyoxalate pathway……….……………………………………………….. 3.3.3 The aromatic aminoacid biosynthetic pathway………………………. 3.4 Conclusion………………………………………………………………………….. 3.5 References………………………………………..…………………………………
3
34
34
35
38
43
43
37
46
48
49
45
34
3.1 Introduction
3.1 Introduction Structural biology is one of the important areas which aids the researcher in a better understanding of the biological macromolecule. This provides much more information than the
primary sequence. With the completed microbial genomes and other genomes, a large-scale effort to solve all the crystal structures from each of the organisms has been undertaken. Thus, the aim of structural genomics is to start with the gene sequence, produce the protein and determine the three-dimensional structure of the protein for each of the completed microbial genomes.
Once the structure is determined, a wealth of information like the idea of active site residues, important residues for stability, etc can be obtained. These information can then be used to design
novel drugs and to engineer proteins for the benefit of mankind. Academic interest for structural genomics includes identification of new folds and classification of proteins in to existing folds. It has always been exciting for structural biologists to uncover a new fold, because this expands the knowledge of the existing protein fold space.
The principle of Structural genomics, which is to determine the structure and then explore the mechanism by which the enzyme acts is exactly different from the classical structural biology
principles, where the protein structure is determined to elucidate the molecular mechanism by which it acts. Some of the key steps involved are described in subsequent paragraphs.
3.1.1 Target Selection
One of the important questions which has to be addressed is the selection of target. i.e. which is the protein form the organism one has to study and the reason for studying the protein. A large scale effort in structural genomics provides relatively little importance to this question, whereas
small scale initiatives pay a lot of attention to this. The reason being the large scale initiatives are well funded and their immediate goals are almost different. One of the major reason why a protein is selected for study is because it is an important virulence
factor that can be used as a potential drug target or it is an industrially important protein which catalyses a novel reaction or just to elucidate the molecular mechanism by which it acts.
Another approach to target selection will be to get atleast one representative structure for each of the enzymes from the known metabolic pathways. In this study, we have looked in to 3 metabolic pathways namely the glycolytic pathway, the glyoxalate shunt and the biosynthesis of aromatic
aminoacids in Mycobacterium tuberculosis and Mycobacterium leprae.
Considering the above mentioned aspects, the major organisms which has been involved for study in general has been Yeast – which can be considered as a model organism to elucidate the biology
of eukaryote , Mycobacterium tuberculosis – whose proteins can be used as a target for structure
35
3.1 Introduction
based drug design and Methanococcus jannaschii – a thermophile which can be used to get information on the residues and interactions which contribute to its thermostability.
Fig. 1: Criteria/reasons for target identification
3.1.2 Expected Results The expected results are according to the goals which the individual structural genomics
consortium aims at. Some of the goals which are set includes, solving a particular number of crystal structures in a set period of time, standardise the procedure for solving the crystal structures, analyse the structure to gain a better understanding of the biology of the organism, use
the structures for structure based drug design, or use them as a subject for protein engineering. Some of the other results which one looks for is the number of unique folds which have been identified in the process, classification of the folds, elucidation of fundamental structural
geometry and principles that determine structure and crucial information which can be used as a constraint while annotating protein function from gene sequences.
For example, as the number of structures increase, a lot of analyses on the crystal structures like the CH..O hydrogen bonding analysis explained in Chapter 2 of the report, can be carried out with a higher confidence level. Thus, the large number of structures in the dataset helps in
asserting the observation obtain from such analyses (in a statistical sense). Also, one can easily come across incidents where the protein of interest from different sources have different folds. In such cases, strong stereochemical constraints can be used in establishing the function of the protein.
36
3.1 Introduction
The case of chorismate mutase, a key enzyme in the aromatic aminoacid biosynthesis is described in Chapter 4 of this report.
Other information which can be obtained is the general structure and surface of the protein, fold, cavities present in the structure, the regions exposed to solvent, information on the biological unit
whether it is a monomer or a dimer and so on. Careful analyses on shape complementarity can also provide information on whether proteins interact with each other. This can be verified by the use of yeast 2-hybrid system studies. Such
networks when superposed over standard metabolic pathways provides valuable information on metabolic-pathway cross-talks. These information would be very important for metabolic engineering of organisms. These information can also validate and help understand previously
studied gene expression profiles under varying conditions.
Fig. 2: Information on metabolic cross-talks like the one shown in figure can be obtained by overlaying the information obtained from interaction networks (using crystal structures) on to
standard metabolic pathways.
37
3.1 Introduction
3.1.3 Limitations Any initiative will have its advantages and limitation. In case of structural genomics, the limitations are diverse. Some are directly related to the techniques for solving the structure and
methods of storing structural information and making it available to the other members of the consortium. Other problems include sharing credit, allotment of funds for building the infrastructure needed to start the project and other personal differences amongst members in the
same consortium.
Fig. 3: Common problems and Limitations in structural genomics initiatives
Yet another major question which is asked is how useful is the structural genomics initiative?. Certainly, there is no doubt about the results and data, which it can produce, but what is the time scale required to convert the available data in to useful information? What will be the real
involvement of structural genomics in new drug discovery? Are some of the serious questions that can be raised logically. Other limitation lies in the target itself. There are some targets, which are not easy to work with,
specially membrane proteins. In such cases, the usefulness of cryo-eletron microscopy will be much appreciated. Thus taking in to account all the above limitations, the first phase of the initiative will mainly focus on monomeric and homodimeric proteins. Once successful in the
attempt, heterodimeric, protein-DNA complex, protein-RNA complex and other multimeric complexes may be tackled.
38
3.2 Structural status of completed microbial genomes in the PDB
3.2 Structural status of completed microbial genomes in the PDB The Protein Data Bank PDB (Bernstein, 1977) has been one of the major initiatives for storing and distributing structural information of proteins. According to the latest release of PDB, there
are more than 14, 000 structures which have been deposited from more than 350 different organisms. This includes structures solved by X-Ray crystallography, NMR, Theoretical structues, etc. The number contributed by the X-Ray crystallography dominates over the other methods.
The statistics of the number of structures deposited per year in the PDB, number of increase in new folds and the total number of folds are given below (courtesy: PDB website, www.rcsb.org)
Fig. 4: Statistics of PDB We, searched for structures solved in the PDB using the source as the criterion, i.e since we were interested in finding out the number of structures for the completed microbial genomes, we used
the name of the organism as the criteria for searching the PDB. However, there were redundant
39
3.2 Structural status of completed microbial genomes in the PDB
structures, because structures of the mutants of the same protein, fragments of the same protein, etc were also deposited. Thus the number does not represent the count of the total number of unique protein crystal structures solved. So we removed the redundant structures from the first
filter to get the number of unique structures solved. From the analysis, it was evident that E. coli ranked number one for the total number of structures solved (1387) and for the highest number of non-redundant structures (433) solved (among
completed microbial genomes). Immediately following E. coli was Bacillus subtilis and Pseudomonas aeruginosa with the number of redundant and non redundant structures being 92, 45, 75, 21 respectively.
For most of the other genomes like Campylobacter jejuni, Mycoplasma genitalium, etc, not even one crystals structure is known and for important pathogens like, Vibrio cholerae, Helicobacter
pylori only 4 and 1 structure is known. This gives a very good picture of the status of structural genomics in completed genomes and tells us that there are a lot of important and relevant organisms that need to be immediately studied.
Fig. 5: Bar graph depicting the number of redundant to non-redundant structures for the completed microbial genomes. The E. coli structures are not included here.
40
3.2 Structural status of completed microbial genomes in the PDB
It was interesting to note that in the case of Halobacterium, there were 17 structures deposited, but the number of unique structure was only one. It was the well-characterised Bacteriorhodopsin protein. Another example was the lysozyme from the T4 Bacteriophage, where all possible
mutational studies was carried out. For relevant pathogens in Indian scenario like M. tuberculosis and M. leprae, only 10 and 2 unique structures are known respectively. This immediately tells us that a lot of concerted and
directed effort has to be put in by scientific organisations and institutions to carry out rational target identification for determining structures.
In chapter. 4, we suggest that chorismate mutase (CM) from Mycobacterium tuberculosis and Mycobacterium leprae could be a very good target, because the information can be immediately used for structure base drug design. We also stress upon chorismate mutase because this enzyme
is not present in humans and the sequences among organisms is divergent enough to develop unique compounds.
Fig. 6: This graph show the percentage of ORFs from each organism for which the crystal
structure is known. for example in the case of E. coli for which the percentage is highest, out of the 4289 identified ORFs, structures for only 10.1 % of the ORF has been solved (1.e 433 structrues).
41
3.2 Structural status of completed microbial genomes in the PDB
The table given below summarises the number of ORFs identified, the number of redundant and
non redundant structures solved for the completed microbial genomes.
Organism Bases ORF Non Redundant Structures
Redundant Structures
A. pernix 1669695 bp 2694 0 0 A. fulgidus 2178400 bp 2420 0 0 Halobacterium 2014239 bp 2058 1 17 M. jannaschii 1664970 bp 1715 15 16 M. thermoautotrophicum 1751377 bp 1869 11 16 P. abyssi 1765118 bp 1765 0 0 P. horikoshii 1738505 bp 2064 0 0 T. acidophilum 1564906 bp 1478 3 8 A. aeolicus 1551335 bp 1522 2 4 B. subtilis 4214814 bp 4100 45 92 B. halodurans 4202353 bp 4066 0 0 B. burgdorferi 910724 bp 850 3 3 Buchnera 640681 bp 564 0 0 C. jejuni 1641481 bp 1654 0 0 C. muridarum 1069411 bp 909 0 0 C. trachomatis 1042519 bp 894 0 0 C. pneumoniae 1229858 bp 1110 0 0 D. radiodurans 2648638 bp 2580 0 0 E. coli 4639221 bp 4289 433 1387 H. influenzae 1830138 bp 1709 9 14 H. pylori 1667867 bp 1553 1 1 L. lactis 2365589 bp 2266 3 10 M. tuberculosis 4411529 bp 3918 10 19 M. leprae 3268203 bp 1604 2 2 M. genitalium 580074 bp 481 0 0 M. pneumoniae 816394 bp 688 0 0 N. meningitidis 2272351 bp 2025 5 6 P. multocida 2257487 bp 2014 0 0 P. aeruginosa 6264403 bp 5565 21 75 R. prowazekii 1111523 bp 834 0 0 Synchocystis 3573470 bp 3169 5 9 T. pallidum 1138011 bp 1031 1 1 T. maritima 1860725 bp 1846 24 32 U. urealyticum 751719 bp 611 0 0 V. cholerae 2961149 bp 2736 4 11 X. fastidiosa 2679306 bp 2766 0 0
Table. 1: current status of structural genomics in the PDB for the completed microbial genomes
42
3.2 Structural status of completed microbial genomes in the PDB
In the subsequent part of this chapter, we will be discussing 3 pathways in M. tuberculosis (cole, 1998) and M. leprae (cole, 2001). The figure below shows all the known crystal structures for the two organisms.
Fig. 7: Known crystal structures from M. leprae
Fig. 8: Known Crystal structures from M. tuberculosis
43
3.3 Metabolic pathways as targets for structural genomics
3.3 Metabolic pathways as targets for structural genomics
In the introduction of this chapter, various criteria which can be used for target selection has been discussed. One can also think of metabolic pathways as targets for structural genomics. We chose three metabolic pathways for study in general and in M. tuberculosis and M. leprae. The pathways which we chose were the well studied glycolytic pathway, the glyoxalate shunt and the
aromatic aminoacid biosynthetic pathway. We were interested to find out whether representative structures for each of the enzymes in the
above mentioned pathways have been solved. Hence, we searched for each of the enzymes in the PDB. We also specifically looked for structures from M. tuberculosis and M. leprae.
3.3.1. Glycolytic pathway
The reason to choose this pathway is because it is biochemically thoroughly studied. There are 10 enzymes in the pathway and as expected all the 10 enzymes in this pathway has atleast one representative structure from some organism.
Fig. 9: The glycolytic pathway with representative structures for each enzyme:
44
3.3 Metabolic pathways as targets for structural genomics
When the Mycobacterium tuberculosis and the Mycobacterium leprae genomes were searched for these 10 proteins. All the 10 enzymes were identified and annotated, whereas no crystal structure has been solved from either of the organisms. Thus this is one of the pathways which could be
very interesting and worth pursuing in a structural genomics initiative because, key enzymes in this pathway could be used as very good drug targets, and any drug which interferes with this pathway could prove fatal for the organism.
Mycobacterium
tuberculosis Mycobacterium
leprae S. No Enzymes in the Glycolytic
pathway Sequence Structure Sequence Structure
1 Hexokinase ? No ? No
2 Phosphogluco Isomerase Rv0946c No ML0150 No
3 Phosphofructo kinase Rv3010c(I), Rv2029c(II)
No ML1701 No
4 Aldolase Rv0363c No ML0286 No
5 Triosephosphate Isomerase Rv1438 No ML0572 No
6 Glyceraldehyde-3-phosphate dehydrogenase
Rv1436 No ML0570 No
7 Phosphoglycerate kinase Rv1437 No ML0571 No
8 Phosphoglycerate mutase Rv0489 No ML2441 No
9 enolase Rv1023 No ML0255 No
10 pyruvate kinase Rv1617 No ML1277 No
Table. 2: shows the status of the glycolytic pathway from M. tuberculosis and M. leprae with respect to annotation and with respect to the structures solved.
From the above table, it is evident that there are no structures solved from this pathway in M.
tuberculosis and M. leprae. Since these are organisms which are relevant to India, this pathway could be a very good starting point for a combined effort. Also, since only 2 structures are solved
for M. leprae genome, any additional structure which is solved from this organism will certainly help understand the biology of the organism better.
One of the interesting work carried out in this lab includes synthesis of “interface peptides”. i.e. It is a known fact that TIM is active only as a dimer, when a synthetic peptide was created which has the sequence of the dimer interface, it was able to knock out the activity of TIM, which means
that this could be used as a potential strategy for drug development (though a lot of study is still required before identifying lead compounds).These are some ways in which structural genomics can give some input.
45
3.3 Metabolic pathways as targets for structural genomics
3.3.2 Glyoxalate pathway The glyoxylate cycle is a modification of the citric acid cycle and constitutes a specialized
anabolic pathway in certain organisms, especially oily seed plants. In plants, bacteria, and yeast, but not in animals, two carbon molecules such as ethanol or acetate are converted to four carbon molecules and ultimately to glucose by the glyoxalate cycle. Cells that contain the glyoxalate cycle enzymes can synthesise all their required carbohydrates from any substrate that is a
precursor of acetyl CoA. Since some of the steps form a part of the TCA cycle, some of the enzymes in this pathway that
could be targeted without disturbing the TCA cycle include Isocitrate Lyase and Malate Dehydrogenase. Infact the only enzyme structure from this pathway to be solved in M.
tuberculosis is the structrue of Isocitrate Lyase. No other structures are know either in M.
tuberculosis or in M. leprae.
Fig. 10: The glyoxalate pathway with representatives structures for each enzyme
Mycobacterium tuberculosis
Mycobacterium leprae S. No
Enzymes in the Glyoxalate Shunt Sequence Structure Sequence Structure
1 Citrate synthase Rv0896 No ML2130 No
2 Aconitase Rv1475c No ML1814 No
3 Isocitrate lyase Rv0467, Rv1915, Rv1916
1F61
ML1985 No
4 Malate synthase Rv1837c No ML2069 No
5 Malate Dehydrogenase Rv1240 No ML1091 No
Table. 3: The table below summarises the status of annotation from the M. tuberculosis and the
M. leprae genome and the structures solved in them from these pathways.
46
3.3 Metabolic pathways as targets for structural genomics
3.3.3 The aromatic aminoacid biosynthetic pathway
The aromatic aminoacid biosynthetic pathway is very interesting because, this pathway again is not present in humans. Humans cannot synthesize aromatic aminoacids, these are essential aminoacids and hence have to be taken in the diet. Microorganisms, have this pathway and hence this pathway makes an excellent target for structural genomics which will shed light on designing
structure based drug design against the enzymes in the pathway. Since some of the proteins are divers among organisms, like the chorismate mutase (explained in chapter 4), it is possible that one can tailor drugs for specific organisms.
Fig. 11: The aromatic aminoacid biosynthetic pathway with representative structures for each
enzymes. When a search was conducted in the PDB for the enzymes in the pathway, it was surprising to note that three structures in this pathway have not been solved, meaning, there are no
representative structures for three enzymes in this pathway at all. The enzymes include, Anthranilate phosphoribosyl transferase, Prephenate dehydrogenase and
Prephenate dehydratase The first enzyme is involved in the biosynthesis of tryptophan and the other two are involved in the biosynthesis of phenylalanine and tyrosine. Thus these two enzymes could be very good targets for solving structures.
47
3.3 Metabolic pathways as targets for structural genomics
The same study as for the other two pathways was carried out and the following table summarises the enzymes annotated in the M. tuberculosis and the M. leprae genomes. It is again to be noted that there are no structures form the M. tuberculosis or from M. leprae for the enzymes in the
pathway.
Mycobacterium
tuberculosis
Mycobacterium
leprae S. No Enzymes in the Aromatic
aminoacid biosynthetic pathway Sequence Structure Sequence Structure
1 Chorismate mutase ? No ? No
2 Prephenate dehydrogenase Rv3754 No ML2472 No
3 Prephenate dehydratase Rv3838c No ML0078 No
4 Anthranilate synthase Rv1609 No No
5 Anthranilate
phosphoribosyltransferase Rv2192c No ML0883 No
6 Phosphoribosyl
anthranilate isomerase ? No ? No
7 Indole-3- glycerol phosphate
synthase Rv1611 No ML1271 No
8 Tryptophan synthase Rv1612, Rv1613
No ML1272, ML1273
No
Table. 4 shows the status of the aromatic aminoacid biosynthetic pathway from M. tuberculosis and M. leprae with respect to annotation and with respect to the structures solved.
48
3.4 Conclusion
3.4 Conclusion The completion of the genome sequencing of pathogens is the starting phase for structural genomics. The reason why structural genomics and structures in itself are interesting because, nature has not followed any strict rules. There are examples where the same enzyme has got
different structures (chorismate mutase) – fig. 12a, there are cases where different enzymes have got the same fold (retinol binding protein and the biliverdin binding protein) – fig. 12b and the last and the most obvious/logical case being the same protein having the same structure across
various organisms (triosephosphate isomerase) – fig. 12c.
It will certainly be exciting and interesting to identify or come across new folds. Since this is only
the start of a big event, structural genomics projects will now keep a lot of biologists at work for the years to come.
Fig. 12a
chorismate mutase from
B. subtilis (top) and
E. coli (bottom)
Fig. 12b
Retinol binding protein (top) and
Biliverdin binding protein (bottom)
Fig. 12c
Triosephosphate isomerase from Plasmodium falciparum (top)
and Leishmania mexicana
(bottom)
49
3.5 References
3.5 References Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer, E.F.Jr, Brice,M.D., Rogers,J.R., Kennard,O., Shimanouchi,T and Tasumi, M., The Protein Data Bank: a computer-based archival file for macromolecular structures, J. Mol. Biol., 1977, 112, 535-542.
Cole ST, et.al, Deciphering the biology of Mycobacterium tuberculosis from the complete
genome sequence, .Nature., 1998, 393(6685),537-544. Cole ST, et.al., Massive gene decay in the leprosy bacillus., Nature. 2001, 409(6823), 1007-
1011.