progetto_final

4
1 Prediction of the structure of the genome of HIV Enrico Bagnoli, Giacomo Urbini Abstract The DNA is made of four bases: adenine, cytosine, guanine and thymine. The sequence of these bases determines the information needed to build and maintain an organism. The pattern with which the four bases alternate in the sequence is not random, but the DNA can be seen as a sequence of clips, each of which has its specific distribution of bases. The aim of this work is to model the complete genome of HIV in a set of clips from various types of homogeneous segments. The first part of the project is focused on the research of the best models. Suddenly the found model is used to analyze the complete genome of the HIV virus. I. I NTRODUCTION The human immunodeficiency virus (HIV) is a lentivirus, a subgroup of retrovirus, that causes the ac- quired immunodeficiency syndrome (AIDS), a condition in humans in which progressive failure of the immune system allows life-threatening opportunistic infections and cancers to thrive. Without treatment, average survival time after infection with HIV is estimated to be 9 to 11 years, depending on the HIV subtype. Infection with HIV occurs by the transfer of blood, semen, vaginal fluid, pre- ejaculate, or breast milk. Within these bodily fluids, HIV is present as both free virus particles and virus within infected immune cells. HIV is different in structure from other retroviruses. It is roughly spherical with a diameter of about 120 nm, around 60 times smaller than a red blood cell, yet large for a virus. It is composed of two copies of positive single-stranded RNA that codes for the virus’s nine genes enclosed by a conical capsid composed of 2,000 copies of the viral protein p24. The single-stranded RNA is tightly bound to nucleocapsid proteins, p7, and enzymes needed for the development of the virion such as reverse transcriptase, proteases, ribonuclease and integrase. A matrix composed of the viral protein p17 surrounds the capsid ensuring the integrity of the virion particle. The great challenge that the scientific community is now facing is to try to analyze and understand the enormous amount of data produced in the laboratory. An important help, especially with regard to biological phenomena, is the use of dynamic models. As regards genetics, several studies have shown that the use of Hidden Markov Model leads to remarkable results in the secondary structure prediction, decoding of proteins and the segmentation of the sequence of nucleotides. The alternation of bases among the sequence is not casual and the pattern determines the synthesis of proteins necessary for the survival of the virus. Only a few part of sequence is of biological interest. An example Figure 1. Diagram of HIV virion. is given by the so-called CpG sites, regions of DNA where a cytosine nucleotide occurs next to a guanine nucleotide in the linear sequence of bases along its length. "CpG" is shorthand for "—C—phosphate—G—", that is, cytosine and guanine separated by only one phosphate. CpG regions have a key role in bioinformatics in fact many genes in mammalian genomes have CpG islands associated with the start of the gene (promoter regions). Because of this, the presence of a CpG island is used to help in the prediction and annotation of genes. II. MATERIALS AND METHODS The project has been implemented in Matlab (The MathWorks, Inc., Natick, Massachusetts, United States.) using the built-in Bioinformatics Toolbox for visualiza- tion and calculation. A. The Genome The complete genome of HIV is made up by 9718 bases. CpG islands show very different conditional prob- abilities P (X t+1 |X t ) than non-CpG island sequence. This suggests we can use a Markov chain model to detect them in any sequence. Given a 4 × 4 table of the

Upload: enrico-bagnoli

Post on 18-Aug-2015

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Progetto_final

1

Prediction of the structure of the genome of HIVEnrico Bagnoli, Giacomo Urbini

Abstract

The DNA is made of four bases: adenine, cytosine, guanine and thymine. The sequence of these basesdetermines the information needed to build and maintain an organism. The pattern with which the fourbases alternate in the sequence is not random, but the DNA can be seen as a sequence of clips, each of whichhas its specific distribution of bases. The aim of this work is to model the complete genome of HIV in a setof clips from various types of homogeneous segments. The first part of the project is focused on the researchof the best models. Suddenly the found model is used to analyze the complete genome of the HIV virus.

I. INTRODUCTION

The human immunodeficiency virus (HIV) is alentivirus, a subgroup of retrovirus, that causes the ac-quired immunodeficiency syndrome (AIDS), a conditionin humans in which progressive failure of the immunesystem allows life-threatening opportunistic infectionsand cancers to thrive. Without treatment, average survivaltime after infection with HIV is estimated to be 9 to 11years, depending on the HIV subtype. Infection with HIVoccurs by the transfer of blood, semen, vaginal fluid, pre-ejaculate, or breast milk. Within these bodily fluids, HIVis present as both free virus particles and virus withininfected immune cells.

HIV is different in structure from other retroviruses.It is roughly spherical with a diameter of about 120 nm,around 60 times smaller than a red blood cell, yet largefor a virus. It is composed of two copies of positivesingle-stranded RNA that codes for the virus’s nine genesenclosed by a conical capsid composed of 2,000 copiesof the viral protein p24. The single-stranded RNA istightly bound to nucleocapsid proteins, p7, and enzymesneeded for the development of the virion such as reversetranscriptase, proteases, ribonuclease and integrase. Amatrix composed of the viral protein p17 surrounds thecapsid ensuring the integrity of the virion particle.

The great challenge that the scientific community isnow facing is to try to analyze and understand theenormous amount of data produced in the laboratory.An important help, especially with regard to biologicalphenomena, is the use of dynamic models. As regardsgenetics, several studies have shown that the use ofHidden Markov Model leads to remarkable results in thesecondary structure prediction, decoding of proteins andthe segmentation of the sequence of nucleotides. Thealternation of bases among the sequence is not casualand the pattern determines the synthesis of proteinsnecessary for the survival of the virus. Only a fewpart of sequence is of biological interest. An example

Figure 1. Diagram of HIV virion.

is given by the so-called CpG sites, regions of DNAwhere a cytosine nucleotide occurs next to a guaninenucleotide in the linear sequence of bases along itslength. "CpG" is shorthand for "—C—phosphate—G—",that is, cytosine and guanine separated by only onephosphate. CpG regions have a key role in bioinformaticsin fact many genes in mammalian genomes have CpGislands associated with the start of the gene (promoterregions). Because of this, the presence of a CpG islandis used to help in the prediction and annotation of genes.

II. MATERIALS AND METHODS

The project has been implemented in Matlab (TheMathWorks, Inc., Natick, Massachusetts, United States.)using the built-in Bioinformatics Toolbox for visualiza-tion and calculation.

A. The Genome

The complete genome of HIV is made up by 9718bases. CpG islands show very different conditional prob-abilities P (Xt+1|Xt) than non-CpG island sequence.This suggests we can use a Markov chain model todetect them in any sequence. Given a 4× 4 table of the

Page 2: Progetto_final

2

conditional probabilities P (Xt+1|Xt) measured in CpGislands, describe how you would construct a Markovchain model of CpG island sequences. Specifically, de-scribe the state graph structure you would use:

• The nodes (states) of our Markov chain state graphare just the four nucleotides A, C, G, T

• Every node has four outgoing edges, to itself andthe other three nucleotides

• The transition probabilities are just the conditionalprobabilities P (Xt+1|Xt) given by the table

Figure 2. The first 1200 bases of HIV genome. The CpG islands arehighlighted. In the all genome there are only 95 CpG regions.

Using the Hidden Markov Model we were able tocalculate the number of states which better balancesthe tradeoff between the accuracy and the complexity.And once we have obtained this number, we calculatethe probability of finding CpG islands for every state:P (CpG | state).

B. AIC and Likelihood

To figure out what is the model that best fits thedataset as the number of hidden states increase, whichare homogeneous segments of DNA, have been used twoparameters.

The first one is the likelihood, a measure of the accu-racy of the model. As expected the likelihood increasewith the complexity and the number of hidden states.An accurate model has an elevate number N of hiddenstates.

The second parameter is the so-called AIC, anacronym for Akaike Information Criterion: a measure ofthe relative quality of a statistical model for a given setof data. So, given a collection of models for the data,AIC estimates the quality of each model relative to theother models. Hence, AIC provides a means for modelselection. AIC is calculated according to the followingformulation:

AIC = 2k − 2 lnL (1)

where k is the number of parameters which is equal toN2+5N , where N is the number of states, and L is thelikelihood of the model. For simplicity we use the log Ldefined as:

L (θ | {xi}ni=1) = lnL (θ | {xi}ni=1) (2)

Comparing the performance of the two functions we canidentify the number of states that can be assumed the bestone according to the tradeoff between capacity predictionand computational cost.

C. Hidden Markov Model

The Hidden Markov Model is a tool for representingprobability distributions over sequences of observations.An Hidden Markov Model has three defining properties:

• Observation Yt at time t was generated by someprocess whose state St is hidden from the observer

• The hidden state is discrete-valued: St can take oneof K values 1...K

• State of the hidden process satisfies the Markovproperty: given the value of St−1,the current stateSt is independent of all the states prior to t− 1.

Figure 3. Schematic representation of the relationship between thestates S and the observations Y.

D. CpG islands

Finally the model created has been used to find theCpG region in the DNA sequence. We expect that thereare states where it is more likely to find CpG sites. Thesestates represent clips of nucleotides of great biologicalinterest connected to the start of a gene.

III. RESULTS

The first result obtained is shown in Figure 4. Theprobabilities of different bases and of different base pairsalong the sequence. As we can see the Adenine is themost frequent base whereas the most frequent base pairis Adenine-Thymine. We used this analisys in the latterstudies.

The number of states that better balances the tradeoffbetween accuracy and complexity is represented by thelower value of the AIC function, as we can see in Fig.5.This number turned out to be 8 and so a model with 8hidden states has been created.

Page 3: Progetto_final

3

Figure 4. Graph showing the density of different bases (up) and ofA-T and C-G base pair (down).

Figure 5. AIC vs LogLikelihood. The best compromise betweencomplexity and accuracy of the model is represented by the little redcross in the AIC function.

Another important outcome is presented in Fig.6, thatshows the probabilities of finding CpG islands for eachstates. We can see that CpG islands appear only in state3,4 and 8 whereas in other states no CpG islands occur.

For each single nucleotide there are a series of pref-erential states in which the presence of such a base ismore likely. These results are shown in Fig.7 using a3D histogram. It’s clear for example that in state 7 it’salmost assured to find an adenine and on the other sideif a nucleotide base is an adenine, 51.39% of the timeis in the state 7.

We also generate a sequence using the HMM createdfor prediction of the genome structure. The sequence

Figure 6. Pie chart showing the occurrences of CpG islands for everystate. We can see that they appear only in states 3,4 and 8.

Figure 7. 3D graph showing the different probabilities of nucleotidesin various states.

predicted it’s very similar to the original one with anaccuracy of over 75%. The precision increases dramat-ically with the number of states and using for example20 hidden states the error committed in the prediction isless than 12%.

The results show that the nucleobases do not alternatein a random manner, but each nucleotide and each statehave next bases and next states more probable. As amatter of fact we can see now the transition probabilitymatrix:

State1 State2 State3 State4State1 3.0030exp−9 7.7163exp−5 2.4315exp−10 4.4288exp−7State2 0.2080 0.4484 1.3930exp−10 0.1416State3 0.1374 2.0784exp−6 0.1092 9.7823exp−14State4 0.5213 0.0026 0.1899 1.7082exp−12State5 6.3914exp−4 1.1263exp−5 0.2419 0.2220State6 5.0015exp−5 0.1184 4.9515exp−6 5.1514exp−8State7 2.5690exp−6 0.2597 0.1023 0.4424State8 1.4999exp−9 2.3341exp−9 0.2421 0.0393

State5 State6 State7 State8State1 1.2122exp−6 0.0578 0.9421 3.2317exp−11State2 5.1007exp−10 0.1631 6.8749exp−5 0.0389State3 0.0810 0.2680 0.0027 0.4017State4 3.7319exp−15 3.4583exp−9 0.2862 3.9064exp−8State5 0.2923 0.2359 4.0401exp−6 0.0072State6 0.4269 0.1446 0.0073 0.3027State7 0.1347 0.0478 4.3723exp−4 0.0126State8 0.0220 0.3315 0.2665 0.0987

Page 4: Progetto_final

4

and the emission probability matrix:

State1 State2 State3 State4Adenine 0.1181 0.2266 0.1146 0Cytosine 0.1309 0 0.2939 0.1055Guanine 0.0143 4.2194exp−4 0 0.3776Thymine 0.12433 0.1141 0 9.2421exp−4

State5 State6 State7 State8Adenine 0 0.0267 0.5139 0Cytosine 0 0 0 0.4698Guanine 0.6076 0 0 0Thymine 0 0.6067 0.0018 0.1520

Figure 8. The alternation of the different 8 states among the sequenceof nucleobases of HIV genome. The data are presented with acolormap where red colors represents higher probability.

IV. CONCLUSION

By analyzing the information obtained from the mod-els we notice that a HMM with a low number of hiddenstates turns out to be less plausible than models withmore hidden states. In fact, with the increase of thetypes of segments provided, there is an increase of thelog-likehood corresponding to a higher goodness of themodel. Despite the improvements in the prediction of thestructure of the HIV genome, the processing time and thecomplexity of a model with many states are considerablyincreased with the increase of the number of hiddenvariables. The use of the AIC parameter is selectedto quantify the variance and also because it allows toevaluate the dualism between the adaptation of the dataand the complexity of the model. The number of statesthat satisfies the compromise between computational costand accuracy of prediction turns out to be 8. In addition,the analysis has highlighted how the nucleotide sequenceof the genome of the HIV virus is not random. The

Markov model, despite its strong assumption of theMarkov property, is able to reproduce, with a number ofacceptable parameters, the dynamics involved. A furtherapplication of the Hidden Markov Models could be thestudy of the occurrences and locations of CpG islandswithin the genome because in these regions the concen-tration of CpG sites is high. Many genes in mammalshave these sequences associated with the regions relatedto a promoter. By virtue of this, the presence of CpGislands can help in the prediction work and research ofgenes.

REFERENCES

[1] Hidden markov models in bioinformatics.[2] EDDY, S. R. What is a hidden markov model? Nature

biotechnology 22, 10 (2004), 1315–1316.[3] HAUSSLER, D. K. D., AND EECKMAN, M. G. R. F. H. A

generalized hidden markov model for the recognition of humangenes in dna. In Proc. Int. Conf. on Intelligent Systems forMolecular Biology, St. Louis (1996), pp. 134–142.

[4] O’KEEFE, R. A. An introduction to hidden markov models.[5] SANGUINETI, V. Analisys and Models of Biomedical Data and

Signals, 2014.[6] WATTS, J. M., DANG, K. K., GORELICK, R. J., LEONARD,

C. W., BESS JR, J. W., SWANSTROM, R., BURCH, C. L., ANDWEEKS, K. M. Architecture and secondary structure of an entirehiv-1 rna genome. Nature 460, 7256 (2009), 711–716.

[7] YOON, B.-J. Hidden markov models and their applications inbiological sequence analysis. Current genomics 10, 6 (2009),402.