1 mapping mutations in hiv rna by nimrod bar-yaakov [email protected] with co-operation of dr....
Post on 21-Dec-2015
220 views
TRANSCRIPT
1
Mapping Mutations in HIV RNA
By Nimrod Bar-Yaakov [email protected]
With co-operation of Dr. Zehava Grossman of the Israel’s Multi-Center AIDS Study Group, National HIV reference Laboratory in Tel-Hashomer.
2
Today’s Topics HIV – What is it and how it operates. What so important about the HIV DNA
mutations? Extracting the RNA sequence for
analyze. Naïve view of the HIV RNA sequences Locating the RNA mutations Analysis of the RNA mutation interactions
3
Virus OverviewViruses may be defined as acellular
organisms whose genomes consist of nucleic acid, and which obligately replicate
inside host cells using host metabolic machinery and
ribosomes to form a pool of components which assemble into particles called VIRIONS, which serve to protect the
genome and to transfer it to other cells.
4
Virus Animation
5
Virus Overview The concept of a virus as an organism
challenges the way we define life: viruses do not respire, nor do they display irritability; they do not move and nor do they grow, however, they do most certainly
reproduce, and may adapt to new hosts.
6
What is an HIV human immunodeficiency virus, A type of
retrovirus that is responsible for the fatal illness Acquired Immunodeficiency Syndrome (AIDS)
Retrovirus – A virus that's carry their genetic material in the form of RNA rather than DNA and have the enzyme reverse transcriptase that can transcribe it into DNA.
In most animals and plants, DNA is usually made into RNA, hence "retro" is used to indicate the opposite direction
7
How does the HIV infects the body cells? HIV begins its infection of a susceptible host cell by
binding to the CD4 receptor on the host cell The genetic material of the virus, which is RNA, is
released and undergoes reverse transcription into DNA, which enters the host cell nucleus where it can be integrated into the genetic material of the cell.
Activation of the host cells results in the transcription of viral DNA into messenger RNA (mRNA), which is then translated into viral proteins.
The viral RNA and viral proteins assemble at the cell membrane into a new virus.
The virus then buds forth from the cell and is released to infect another cell.
8
Treatment related to the active RNA sites The HIV DNA generates proteins that are essential
to the virus life-cycle.Medical treatment interfere or block the operation of these proteins.
Reverse Transcriptase medicines:Inhibits the transcription of the HIV RNA into the cell’s DNA
The HIV protease protein, is required to process other HIV proteins into their functional forms. Protease inhibitors medicines, act by blocking this critical maturation step.
9
RNA mutations Environmental/Biological processes may
cause mutations in the HIV RNA. The mutated HIV RNA merge into the
infected cell’s DNA. The generated Amino-Acids sequence is
then altered. A different Protein is generated by the
cell. The altered protein may resist the
medical treatment!
10
Mutation families The HIV RNA has a high mutation rate (a
1000 times more than a regular cell). Fast evolutionary processes causes the
best mutated viruses to increase their population in the infected body.
We’ll focus on 3 main mutation families: Resistance mutations Clade mutations Other – noise/random
11
The importance of identifying the resistance mutations Selecting the best medicine
treatments Understanding the way different
medicines interacts with the HIV Understanding the functional
interpretation of the RNA sequence
12
Extracting the RNA Sequence The RNA sequences are transcript into DNA
sequences. The DNA sequences then multiplied
several times A DNA sequencer ‘read’ the aligned DNA
sequences. The decision how to interpret a specific
DNA segment is based over image processing algorithms (define the segment boundaries and find the best match for the segment pattern) and isn’t deterministic!
13
Sequence Alignment (from Ron Shamir’s Course)
14
15
Sequence Alignment Before alignmentAtaaagakagggggacagctaaaagaggctctcTTAGACACAGGAGCAGATGATACAACTCTTTGGCAGCGaCCCCGTTGTCACaATAAAAATagGGGGACAGCTAAgGGagGcTAAAAGAGGCTCTCTTAGCACACAGGMGCAGAYGAYACAGTMCTTASCAAGAAATAAACTCTTTGGCAGCGACCCCTTGTcACAATAAAAGTAGAGGGACAGCTAAGGGAKGCTACTCTTTGGCAGCGaCCCCTTGTCACAATAAAAATAGGGGACAGCTAAGGGAGGCTCACTCTTTGGcAGCGACCCCTtGTCACAATAAAAGtAGGGGGaCAGCTAAAgGAGGCTaCTnTTnGRCAGCGaCCCCTTgTCYCARtAAAAATAGGGGGGCAGRTAARGGAGGCt
After Alignment------------------------------ATAAAGAKAGGGGG-ACAG-CTAAAAGAGG------------C-GACCCC--TTGTCACAATAARAATAGGGGG-ACAG-CTAAAAGAGGACTCTTTGGCAAC-GACCCC--TTGTCACAATAAGAGTAGGGGG-ACAG-CTAAAAGAGG-CTCTTTGGCAAC-GA-CCCC-TTGTCACAGTAAAAATAGRAGG-ACAG-CTAAAAGAAGACTCTTTGGCAAC-GA-CCCC-TTGTCACAGTAAAAATAGGAGG-ACAG-CTAAAAGAAGACTCTTTGGCAAC-GA-CCCC-TTGTCACAGTAAAAATAGGAGG-ACAG-CTMAAAGAAGACTCTTTGGCAAC-GA-CCCC-TTGTCACAGTAAGAATAGGAGG-ACAG-CTAAAAGAAG
Degapping---------------------------ATAAAGAKAGGGGGACAGCTAAAAGAGGC------------CGACCCCTTGTCACAATAARAATAGGGGGACAGCTAAAAGAGGCACTCTTTGGCAACGACCCCTTGTCACAATAAGAGTAGGGGGACAGCTAAAAGAGGC-CTCTTTGGCAACGACCCCTTGTCACAGTAAAAATAGRAGGACAGCTAAAAGAAGCACTCTTTGGCAACGACCCCTTGTCACAGTAAAAATAGGAGGACAGCTAAAAGAAGCACTCTTTGGCAACGACCCCTTGTCACAGTAAAAATAGGAGGACAGCTMAAAGAAGCACTCTTTGGCAACGACCCCTTGTCACAGTAAGAATAGGAGGACAGCTAAAAGAAGCACTCTTTGGCAACGACCCCTTGTCACAGTAAGAATAGGAGGACAGCTAAAAGAAGC
16
Reduction from Bio problem to CS Problem Generation of a consensus RNA sequence. For each sequence, generate a matching
binary sequence, each 1 represents a mismatch between the consensus and the original sequence, and 0 represents a match.
Now we have a binary feature vector for each sample.
We can now calculate the correlations between the mutations to the treatment and between the mutations to themselves.
17
From Sequences to Mutation Matrix
18
So where are the problems? Curse of dimensionality Noisy data Sequenced data are of stochastic nature Small number of samples Clades and sub-clades Vague definitions of independent
variables values. Silent mutations Talk Bio language!
19
Data Overlook
20
Frequencies of Mutations occurrences
21
Filtering the Data Mutations that occur less than 5
times in a specific RNA index cannot considered significant (we’ll see it later in the Chi square slides)
We’ll filter all the mutations that occur less than 3 times and replace them with the consensus value.
Thus filtering much of the noise.
22
Naïve clustering of Data
Clade Distribution
Treatment Distribution
120 9 12 59 8 29 215 65 147 7
Cluster Size
Total Cases
671
Clustering of 671 RNA samples using Centroid linkage
A
C
B
Treated
Non-Tr
23
Feature Extraction Better to have misdetection than a
false alarm. Filter the noisy data Work within the clades Locate the mutations (features) that
are highly correlate with treatment. Now we have only few dozens of
features to work on.
24
Finding mutations and treatment correlation We want to find for each RNA
index i whether P(Mut_in_i) is significantly different from P(Mut_in_ i/ Treatment).
We’ll use the CHI square distribution test for each index to find that.
25
Chi Square Overview We will use the Chi-Square test to
check the probability that our observed results had came from the same statistical population as the expected (chance) results.
A probability of less than 0.05 means that the results are significant, I.e the populations are significally different .
26
Chi Square Calculations Calculating the chi-square statistic
–
The probability Q that a X2 value calculated for an experiment with d degrees of freedom
(where d=k-1) is due to chance is:
27
Example – Mutation V82A
TreatedMut 30 Observed Treated NonTreated TotalNonTreatedMut 12 Mut 30 12 42TotalTreated 77 NonMut 47 377 424TotalUnTreated 389 Total 77 389 466
Expected Treated NonTreated TotalMut 6.93991416 35.06008584 42NonMut 70.0600858 353.9399142 424Chi Statistic 10.71333 466ChiVal 9.751E-24
28
Mutation Table
Clade SequenceNum Concensus Mutation AA pos Concens AAMut AA ChiVal TreatedMut UntreatedMutC 32 A G 14 lys arg <0.05 44 33C 39 G A 16 Gly Gly <0.003 17 7C 47 T C 19 leu pro <0.05 45 38C 50 A G 20 lys arg <0.001 63 37C 69 A G 26 thr thr <0.05 9 4C 96 A C 35 glu asp <0.05 50 40C 151 A G 54 ileu met (atg) <0.001 22 10C 155 A G 55 ileu val <0.05 7 2C 162 A G 57 arg arg <0.05 7 2C 203 C T 71 ala val <0.001 20 8C 211 A T/G 74 thr ser/ala <0.001 57 39C 217 C T 76 leu ser <0.05 105 117C 235 G A 82 val ile <0.05 8 26C 236 T C 82 val ala <0.001 30 12C 241 A G 84 ileu val <0.02 12 6C 259 T A 90 leu met <0.001 41 22C 276 C T 95 cys cys <0.05 106 117
29
Calculating the mutations Correlations Matrix Because the treatment is a major
artifact in all the treatment mutations, we’ll have to find the correlations within the treated samples:
P(mut_A/Treat.) ~ P(mut_A/mut_b,Treat.) Our Chi-Square table will be (all in
treated cases)– Mut B Non -
Mut BTotal
Mut A
Non Mut A
Total
30
Example correlation results
Total Samples Total Treated466 195
First site Index Second Site Index
Chi Value Number of A mutations
Number of B mutations
Shared number of Mutations
Probability of Mutation in A
Probability of Mutation in A when B is also mutated
19 50 0 44 63 28 0.2256 0.444419 259 0 44 41 20 0.2256 0.487850 96 0.0001 63 50 27 0.3231 0.5469 39 0.0001 9 17 4 0.0462 0.2353
151 203 0 22 20 12 0.1128 0.6
31
Example – Mutation D30N D30N is an important resistance mutation.
But it appears at frequency of 0.0258 in the C clade compare with 0.0945 in the B clade, What’s the explanation for this?
Correlation analysis reveals that in clade B, D30N is highly correlated with other resistance Mutations. In clade C it’s not.
One assumption can be that the Clade B structure can influence the connections between resistance mutations.
32
Using CART to find mutations interactions A regression tree is a sequence of
questions that can be answered as yes or no, plus a set of fitted response values. Each question asks whether a predictor satisfies a given condition.
In our research we will ask whether a mutation i (1 value at i index), predicts the existence of mutation j (1 value at j index).
This way we can identify relationships between the mutations.
33
CART results – D95M
34
Using clustering to find mutations patterns
We’ll cluster the mutation sample vectors in order to locate mutation patterns.
Our distance function will be the sum of differences between two samples.
We’ll use the ward method to cluster nodes.
35
Ward Clustering Centroid linkage uses the distance between the
centroids of the two groups:
Where and Xs defined similarly.
Ward linkage uses the incremental sum of squares; that is, the increase in the total
within-group sum of squares as a result of joining groups r and s. It is given by
Where drs is the distance between cluster r and cluster s defined in the Centroid linkage. The
within-group sum of squares of a cluster is defined as the sum of the squares of the distance between all objects in the cluster and the centroid of the cluster.
36
Cluster results
37
Using clustering to find mutations patterns When we filter the mutation only to significant ones, we
can see mutations pattern as a result of clustering -
Mutations
Samples
38
What’s next? Biological interpretation of the findings: Locating Amino-Acid and protein
functional changes. May lead to better understand of resistance behavior.
Identifying new resistance mutations and specific treatment/resistance correlations.
Focus on specific treatments, apply additional research in order to investigate the efficiency of such treatment.
39
The End!
Thank you for listening