1 chava what is chava? chava (cnv hmm analysis visual
TRANSCRIPT
1
CHAVA
What is CHAVA?
CHAVA (CNV HMM Analysis Visual Application), is a visual tool developed
in JAVA conceived to help in the calling of Copy Number Variations (CNVs)
from data generated in array-based experiments of Comparative Genomic
Hybridization (aCGH).
In aCGH experiments, DNA from 2 individuals (we will call them "target" and
"reference" throughout this tutorial) is marked with two different dyes and
mixed. This DNA mix is hybridized into an array of probes with known genomic
positions and, for each probe, the difference of intensity of both dyes is
measured. We expect that in zones where both individuals have identical number
of DNA copies, the intensities will be similar. In those zones, however, where
the two individuals differ in copy numbers (for example one has 2 copies, 1 in
each chromosome, while the other has a full deletion) there will be differences in
intensities.
As a quality control measure, the experiment is often repeated swapping the dyes
used in each individual (throughout this tutorial we will refer to these two
experiments as "direct" and "swap"). Results from both experiments should be
equivalent. Given that we always measure the intensity of the same dye
compared to the other, regions with high values in one experiment should
correspond to regions with low values in the other and vice versa.
Data resulting from each of these experiments consist in a set of values (a
function of the differences in intensity for every probe together with its genomic
position.
Here we have an example of these data in the intensity file format that CHAVA
uses:
1 8000 -0.333 -0.336
1 8500 -0.175 -0.178
1 9000 0.191 0.188
1 9350 -0.3 -0.303
1 9900 -0.077 -0.08
1 10250 -0.448 -0.451
1 10800 0.32 0.317
1 11250 -0.224 -0.227
1 11650 -0.153 -0.156
Where every column corresponds, from left to right, to:
Chromosome
Position
Raw intensity difference
Intensity difference
Markers used in a CGH experiment can be distributed in a regular way along the
genome or they can be clustered around certain zones which are considered
interesting and candidates to present CNVs. There are common commercial
2
arrays, for example the oligonucleotide NimbleGen array formed by 385k probes
and employed to generate the data used in this tutorial, which have this kind of
clustered structure.
Once these experiments have generated the data on intensity differences, a
statistical analysis is required to decide which regions show evidence of CNV
presence. This decision process is known as "calling". There are diverse
procedures and available software packages to help us with the calling. An
approach using Hidden Markov Models (HMM) is frequently used.
HMM assumes that a string of data has been generated by an advancing
automaton which, at each step, has a certain state (one from a finite list of them)
and generates data for that step with probability according to a function of the
current state. This function is referred in HMM literature as the "emission
probabilities". Also, when advancing a step, the automaton can change from its
current state to a new one following also, a function of the current state (called
in this case "transition probabilities").
Given a string of data and emission and transition probabilities, we can estimate
the string of states that can have generated the data with the maximum
likelihood. In fact HMM is a model of choice in multiple problem-solving
situations because it allows for an exact solution for the maximum likelihood
combination to be found in short computational time using the Viterbi algorithm.
In the case of CNV calling, the string of data will be the relative intensity values
resulting from the aCGH experiment and the string of states that we have to
estimate will be a value of "deletion"/"identity"/"amplification" for each marker.
HMM is an efficient tool of direct application here because of the lineal nature
of the data we are working with (markers lying consecutively along the
genome). Using HMM, however, can be trickier when markers are clustered in
certain zones or when experiments present a lot of noise because of poor DNA
quality or other reasons. In these situations, determining the right HMM
parameters to use (emission and transition probabilities) becomes a complex and
cumbersome task.
CHAVA intends to help in these situations by means of providing a graphical
tool that allows visual navigation of CGH results, estimated calls and optionally
added further information. Using these tools and the help of consistency
statistics between experiments, the user is expected to be able to use his or her
visual intuition to optimize the HMM parameters that provide a satisfactory
HMM calling.
Downloading and Starting CHAVA
After unzipping the compressed file that we can download from the web
(http://bioevo.upf.edu/~cmorcillo/tools.htm), we should find the following
content:
3
Documents
Icons
cnv.bat
cnv.jar
Data
lib
The "Data" directory include files with aCGH intensities ratio data from a study
comparing CNV regions in primates (Gazave, E. et al. 2011). Measures in the
example files correspond to DNA from two chimpanzees hybridized against a
customized oligonucleotide NimbleGen array based on the human genome. In
the data we will use as examples, both case and reference samples are
chimpanzees but the array is made of human probes. Using crossed species can
cause added problems to the hybridization process that can result in worse
consistency between experiments and less reliability of the obtained calls.
CHAVA is a JAVA application that should run in every platform. To help the
graphical processes, the program is started using 1,4Gb of RAM memory. In
computers with small memories the application may run slowly. 2 GB of RAM
memory and screen resolution of 1280x600 or more are needed to run
comfortably.
* We start the program
java -jar -Xmx1400m cnv.jar
Alternatively, in windows environments we can simply click the cnv.bat file.
4
Uploading intensity data
We are now going to upload the intensity differences files for the direct and
swap experiments.
* Upload "direct file":
Menu File / Open direct File
Select file Data/px22_dir.data
We obtain:
5
Every vertical segment corresponds to a probe and its length is equivalent to the
log2 ratio of intensity differences between case and reference sample. Blue lines
mark the scale for intensity difference values (-2, -1, 0, +1 y +2 standard
deviations).
The experiment we just uploaded contains around 350,000 probes. In the current
window the first 10,000 probes appear. The program, as we will see later, allows
navigation along all the experiment with different levels of resolution.
It is expected that in regions without differences in CNVs between the target and
reference samples, the values of markers will drift around zero showing only
statistical noise. However, in regions with CNV differences, we expect a
consistent bias towards one direction along the whole fragment. In our data,
there is a region that clearly shows copy differences between target and
reference samples (marked in blue):
6
That can be interpreted as a lack of a certain number of copies in the target
sample that are present in the reference sample.
If we hover with our mouse over this region, we obtain its coordinates in the
Position Info Panel (marked in red in the previous image).
The figure above indicates us that we are hovering exactly over marker 2,089
located in chromosome 1 at position 2,662,663.
7
If the zones showing intensity differences are real CNVs and not experimental
artifacts, we expect a coincidence between the direct and swap experiments. To
check it, let's upload data from swap experiment:
* Upload "swap file":
Menu File / Open swap File
Select file Data/px22_sw.data
In the fragment that we just selected, we can see that the values of the swap
experiment fit perfectly with what we would expect: a deviation of similar
length and intensity but with inverted sign because dyes were swapped between
samples.
8
HMM Calling
However, not all fragments with CNVs are going to be as conspicuous as the one
we just saw and we need, moreover, a statistical criterion to distinguish the
candidate regions. For example, there is another region in this same window,
which presents smaller intensity differences but, since it is much longer, may be
also indicative of the presence of CNVs:
9
CHAVA, as explained in the introduction, uses a Hidden Markov Model to
decide the CNV calling.
Our first step will be to call CNVs using the program default HMM
configuration values without considering if they are suitable or not. (In fact they
are not and we are going to get a defective calling that we will improve along the
tutorial).
* For the direct experiment, press the RUN button placed in the left bottom next
to the Direct Information Panel (the panel, at the moment, looks quite empty,
but don't worry about this now) .
We obtain:
10
After running we can see, marked in red and green, those regions that HMM
estimation has considered as candidates to be regions with deletions and
amplifications target sample relative to the reference sample.
At this point, a set of statistics summarizing the results of the calling just
performed are written into the Direct Information Panel:
They inform us that the number of calls (defined as continuous segments with
the same status) of type deletion (-) and amplification (+) estimated along the
experiment are 377 and 274, respectively. They also show us the amount of
Kbases included in those calls. Recall that these numbers correspond to the
whole experiment and not only to the 10,000 first markers that are shown in the
window at this moment.
We run the HMM estimation also for the swap experiment
11
* For the swap experiment, press the RUN button placed in the right bottom
next to the Swap Information Panel.
We obtain:
We can see that, just as in the direct case, the colored segments and the statistics
corresponding to the swap experiment have appeared in the window.
Since we have now performed the HMM estimation for both experiments, an
additional set of information appears that compares them and gives us a glimpse
of their consistency. For example, if we look at the Swap Information Panel,
we see:
12
All these figures can be interpreted as follows:
521 calls of the deletion kind (-) have been estimated. Out of which 146 are
coincident with the direct experiment. A call is considered coincident if
somewhere in its span, there is a call of opposite sign in the complementary
experiment.
These 521 calls cover 54.3 Kbases and in 6.6 of these Kbases there is call
coincidence with the complementary experiment.
Finally, we have global figures for all kind of calls that tell us that coincidence
between both experiments is 34% taking calls as a comparison unit and 23% if
the coincidence is measured as DNA length.
Clearly, the CNV calling done until this point is quite unsatisfactory.
Numerically, because statistics are poor, and also visually we can see that the
calling didn't work. We have lots of very short calls which show, moreover, little
consistency between experiments. For example:
13
The images above are clearly full of false positives coming from statistical noise.
There are also some regions where there seem to be a clear CNV that have been
fragmented by the excessive sensitivity of the calling as has been performed up
to now:
So the next step is working with the HMM configuration options to try to
improve the quality of our calling.
HMM Configuration
The Hidden Markov model used in CHAVA for CNV estimation assumes the
existence of 3 states (identity, deletion, amplification). Each state has its
associated transition probabilities to other states and a probability of emission of
a continuous value (the log2 of the ration of intensities) that follows a normal
distribution defined by its mean and standard deviation.
Direct and swap experiments are processed using two independent models that
can have a different set of parameters. This makes biological sense, since we are
studying two different experiments, which can vary considerably in their
conditions of execution.
We can consult and modify the parameters of our models pressing the Config
Button placed next to the direct and swap Information Panels.
* Press the Config Button of the direct experiment:
14
A window appears with the values of the parameters of the direct model:
Basically, we can see that emissions for each state are centered in -1,0 and 1
with a standard deviation of 1. We can also see that transition probabilities tend
to favor that states change rarely and to prevent direct transitions between
deletion and amplification without crossing the identity state.
Using this window we can manually modify whatever parameter we want and
we can also save its content to a file or upload configurations previously saved
into files of the type *.hmc (hmm configuration).
In this case, we are going to upload now a file called "sd3.hmc".
* HMM configuration window MENU: File - Open
15
Open the file "Data/sd3.hmc":
HMM configuration should now look as follows
16
where the only thing that has changed from the previous configuration is the SD
of all emissions that now have value of 3.0 instead of the former 1.0.
Press the OK Button and the HMM configuration will be updated.
We perform the same operation with the HMM configuration of the swap
experiment
until we obtain, after uploading the same file: "sd3.hmc" or modifying the fields
manually the following configuration:
and we press the OK Button.
Now we can execute again the calling for both experiments:
* Press Run Button of direct experiment:
17
* Press Run Button of swap experiment:
and we get a calling that looks much cleaner and that is closer to what we
expected
18
Before going on with the different options of data analysis that CHAVA
provides, let's have a look at the navigation capabilities of the program to browse
all the data in the experiments.
Study Navigation
The initial view we have been working with until now presents the first 10,000
markers of both experiments. We can navigate along our study using the
Navigation Arrows.
To the Beginning / Back / Forward / To the end
If we keep pressing the Forward Arrow, we will have a quick landscape view
of the quality of the calling performed a moment ago. Notice what happens with
a few consecutive clicks:
19
20
(A red line appears when markers change chromosome)
etc...
Key combinations alt + right arrow and alt + left arrow are equivalent to
Forward and Back Buttons.
On the upper left corner we can find the Navigation Panel, where the first
marker and the total numbers of markers of the view can be set.
If we enter new values (with a maximum of 10,000 for the number of markers)
and press "enter" the image will adapt to the new indications.
For example
* Using the Navigation Panel we change our view so that first marker shown
will be 75,000:
(after entering your numbers into the First Marker Box, remember to press
enter and resist the temptation of pressing the reset Button )
21
Whenever we want to, we can zoom in with the mouse just by clicking on the
zone of interest.
* Click with the mouse on the red segment of the direct experiment:
22
We can go on zooming in until we reach the desired resolution level.
If we click again:
23
For zooming out one level we can use the "zoom out" option of the right mouse
button.
The other two options that can be called with the right mouse button allow us to
center the image on the clicked point and to save the current image into a PNG
format file.
If we want to get back directly to the bigger view (10,000 markers) instead of
zooming out step by step, we can press the Number of Markers Reset Button
in the Navigation Panel.
24
Tracks
The data generated by the experiment and the current calling visualization can
be combined with further information from other sources. In our case, we have
information about a previous CNV calling performed using a different algorithm
on these same data.
We have this information codified in 2 files: "direct_old_call.trk" and
"swap_old_call.trk" for the direct and swap experiments respectively.
These files describe a set of segments providing their genomic positions together
with the calling value.
CHR START END VALUE
1 703450 753950 0.2816
1 761376 795876 0.2193
1 822700 864000 -0.2277
1 2617163 2666063 -1.0053
1 12973688 12999288 -0.2548
1 12999788 13013288 0.5834
CHAVA will interpret the segments with values <0 as deletions and values>0 as
amplifications.
Let's upload these files as visual tracks into our graph:
* Go back to the original view using the First Marker and Number of
Markers Reset Buttons.
25
* Upload as a Direct Track the file with the previous direct experiment calling
information.
Menu File/ SetTracks / Direct
26
Select the file "direct_old_call.trk".
Press OK
The Direct Track appears over our direct data plot:
27
where the deletion calls appear as red segments and the amplifications as green
ones.
* Repeat the operation for the Swap Track with the file "swap_old_call.data"
using the option:
Menu File/ SetTracks / Swap
and we finally obtain:
28
Now we can navigate throughout the whole experiment using the movement and
zoom tools to compare the results of our current calling to those obtained with
the former technique. For instance:
29
Our experiments current calls can be easily converted into the Direct and Swap
Tracks. Let's do it:
* Menu: Report / Direct Calls / To Track
* Menu: Report / Swap Calls / To Track
30
We can see now that the Direct and SwapTracks coincide with our current
HMM estimated calls:
Go back to the initial view
31
All the previous options will be useful in the process of refining HMM
parameters.
As an example, we are going to modify the direct HMM configuration, run the
HMM estimation again and compare the newly obtained calls with the old
configuration ones which we recorded a moment ago as a track:
* Modify the direct experiment HMM configuration. Set SD=4 for all 3 states.
* Execute HMM estimation for the direct experiment with the new
configuration:
32
and that is what we obtain:
In the zone marked in red we can see how the new HMM estimation with SD=4
has created a unique continuous segment there where the old HMM estimation
with SD=3 created two different segments:
33
Besides the experiment associated Direct and Swap Tracks, there are other two
tracks, called Consensus I and Consensus II Tracks. They were originally
conceived to add information about the final resulting global call but they can
also be used to add whatever data that we may consider relevant to our study.
For example, in the file "Data/genes.trk", a track has been added with the list of
human genes, downloaded from the Homo Sapiens gene database of Ensembl
(GRCh37.p2). In the making of the track, a "LABEL" column with the HGNC
(HUGO gene name) for each gene was added:
START END CHR LABEL
33772367 33786699 1 A3GALT2
12776118 12788726 1 AADACL3
12704566 12727097 1 AADACL4
94458393 94586688 1 ABCA4
229652329 229694442 1 ABCB10
94883933 94984222 1 ABCD3
179068462 179198819 1 ABL2
76190036 76229364 1 ACADM
...
* Upload the file "Data/genes.trk" as a Consensus I Track
Menu: File/Set Tracks/Consensus I
34
We get the new track in blue in the center of the image, marking the genes, so
that it can be used as a reference when analyzing our CNV calls. If we hover
with the mouse over a track segment, the corresponding gene label will appear.
35
To make data visualization clearer, we can select at any given moment which
ones among the uploaded tracks are going to be shown, using the option:
Menu: View/Tracks:
Structure
In many cases, markers employed in arrays designed for aCGH are not
distributed in a uniform way along the genome but are clustered in certain
regions of interest. This raises a number of issues in the process of visualization
and analysis of data.
For instance, the NimbleGen array used to generate the data of this tutorial,
clustered probes in regions that presented evidence of harboring CNVs in
36
previous experiments that used a range of different technologies (Gazave, E. et
al. 2011).
If we slowly run the mouse over the markers of the plot and look at their genome
positions shown in the Position Info Panel, we can observe that sometimes
there are sudden changes that indicate regions without probe coverage. Of
course, if we enter a track with segments located in zones with no markers, these
segments will not be shown in CHAVA's panel. We can tell the program to
inform us when gaps beyond a given size are detected
* Visualize gaps between markers with the option:
Menu: View/By distance
Now, orange lines appear between those markers that are separated by a distance
bigger that the maximum allowed gap (by default 10Kb).
We can, of course, change that default gap size.
37
* Set the maximum allowed gap to 100Kb
menu "View/Set Max Gap"
now we have, logically, less gap marks.
Since we don't need it now, we can hide the Consensus I Track to simplify the
current view:
38
* Hide Consensus I Track
Menu: View/Tracks/Consensus I
The visualization of probe gaps can give us information that can be useful in
certain circumstances. The really interesting thing, however, is to have the
possibility of using CHAVA with any kind of array.
The distribution of probes along the genome, which has to be consistent with the
clusters of probes that we are using in our experiment in order to be of use,
involves the grouping of the markers in segments that can be treated
independently.
39
A structure can be, simply, a definition of a few segments we want to study in
our experiment. The file "structure_5_segments.str" contains a dummy definition
of 5 segments just to show the concept and see how it works:
CHR START END INIT_MARKER END_MARKER
1 476579 496779 299 337
1 12741839 12806959 3266 3396
1 13621050 13649400 4507 4559
1 63134919 63313697 5769 6145
1 70018759 70060225 6146 6230
* Upload the structure defined in the file "structure_5_segments.str"
Menu: File / Open Structure File
* Change to view mode "By Structure"
Menu: View/By Structure
40
And what we obtain is this:
Looking closer:
41
There are now vertical orange lines which indicate the limits of the segments
that we have defined plus a horizontal thick lines (pointed by the blue arrows)
which show the length of every segment and allow us to distinguish which zones
are inside or outside every segment.
When executing processes associated with the structure of the array , they will
apply only to those markers included in the defined segments.
The file "Data/structure.str", instead of having only 5 little segments defined, as
was the case of the previous structure, presents a clustering of almost all probes
of our data into segments.
* Upload the file structure "Data/structure.str"
Menu: File / Open Structure File
42
and we have:
Since the current structure covers almost all probes, the horizontal orange line,
that shows the segment length, seems to be continuous. There are, however, little
gaps that can be located:
* Using the Navigation Panel, let's go to position 312 with a 300 probes view
we obtain:
43
The regions marked by the red circles correspond to probes between close
segments that do not belong themselves to any segment. These zones will be
excluded when performing structure associated processes.
We are going to run now again the HMM estimation but, instead of taking the
list of probes as a continuous emission, there is going to be an independent
estimation for each of the segments defined in the structure.
* Go back to the initial view using both Reset Buttons in the Navigation Panel.
44
We encode the current calls into de Direct and Swap Tracks. Recall that we
performed our HMM estimation with emission SD=4 for all states in the direct
experiment and emission SD=3 in the swap experiment for all states too. Doing
this we will be able to compare differences between the global calling and the
segmented calling.
* Menu: Report / Direct Calls / To Track
* Menu: Report / Swap Calls / To Track
45
* Activate segmented mode
Menu: Process / Segmented
We execute again the CNV estimation for the direct and swap experiments.
* Press the Run Button for the direct experiment.
46
* Press the Run Button for the swap experiment.
47
The segmented calling generated has clear differences with the previous global
one (which, remember, is recorded in the Direct and Swap Tracks). Those
differences can fall into different categories as can be seen in the following
examples:
Coordinates of the regions shown are added so that readers can examine the zone
by themselves. In this case, there were no calls at all with the global estimation
(no track segments).
Among the new calls appeared with the segmented estimation, some of them
(those marked as "1") present high consistency between direct and swap and so
hint at a real thing there. They are interrupted by the region marked as "2" where
there are no calls because it doesn't belong to any segment (no horizontal orange
line in it) and so its markers have not been used for HMM estimation.
The new call marked as "3" doesn't present consistency among experiments and,
so, it is probably an artifact.
In general, HMM calling fragmentation turns the method more sensitive because
it removes the influence that the previous and posterior markers would have in a
global calling. This influence has to be considered as an artifact because those
markers outside the segment are not generally physically immediate and can be
located very far away in the genome.
48
In this case we see that the structured calling has created a call fragmentation in
previously continuous calls.
49
Here, in the first call, we have improved the consistency between experiments
because now we have, in the direct experiment, the deletion call complementary
to the call already present in the swap experiment. In the swap experiment,
however, a new non consistent call has appeared.
50
Finally, very often, both methods are totally equivalent.
Reporting
We can create reports with the calls obtained in our CNV estimation in the direct
and swap experiments.
* Create a report of the direct experiment calls into a file called: "direct.rpt"
Menu: Report / Direct Calls / To File
The file "direct.rpt" will be created with the following content:
CHR START END VALUE INIT_MARKER END_MARKER
1 703450 721000 1 480 490
1 761376 795876 1 503 571
1 2611463 2666063 -1 2009 2095
1 12813409 12899160 1 3396 3544
1 12999788 13014845 1 3724 3756
1 13071269 13142374 1 3757 3887
51
1 13147574 13180974 1 3888 3942
1 13352469 13401820 1 4139 4218
1 13408570 13457420 1 4219 4309
1 16639408 16659972 1 4645 4677
...
This file can be uploaded, if we want, into any of the CHAVA Tracks to be
used as graphical information.
A swap experiment calls report can be generated in an analogous way.
Statistics shown in the Direct and Swap Information Panels can also be written
into a file.
¨
* Generate a statistics report called "stats.std"
Menu: Report / Statistics
a file will be created with the following content:
Direct
Calls Kbs
Total Match Total Match
52
- 66 51 4906 4400
+ 86 69 6604 5346
All 152 120 11510 9747
Dye Coherence 0.7894737 0.846788
Swap
Calls Kbs
Total Match Total Match
- 115 73 7300 5346
+ 110 51 8706 4400
All 225 124 16007 9747
Dye Coherence 0.5511111 0.6089204
Finally, we can always save the current image in CHAVA into a PNG file using
the "Save png" option of the mouse right button.
Command Line Options
CHAVA offers the possibility of working in command line mode, so the
execution of CNV estimations can be scripted and automated.
Using the same files from the graphical tutorial above, we can execute the
following:
java -jar cnv.jar -command -direct px22_dir.data -hmmDirect sd3.hmc -
swap px22_sw.data -hmmSwap sd3.hmc -structure structure.str -out px22
Were the different parameters mean:
-command CHAVA is executed in command line mode and no
graphical window is open
-direct <file> The Intensity File with direct experiment data.
-swap <file> The Intensity File with swap experiment data.
-hmmdirect <file> HMM configuration file that will be used in the
direct experiment calling.
-hmmswap <file> HMM configuration file that will be used in the
swap experiment calling.
-structure <file> Structure File to be used in direct and swap calling.
53
-out <string> Prefix of all output files generated during the
execution of the program.
Using the former command, a CNV calling for the specified files will be
performed using their respective HMM definitions under the defined structure.
The following files will be generated:
px22_direct.report List of calls generated in the direct experiment
calling. Equivalent to the option:
Menu: Report / Direct Calls / To File
in the visual version of the program.
px22_swap.report List of calls generated in the swap experiment
calling. Equivalent to the option:
Menu: Report / Swap Calls / To File
in the visual version of the program.
px22.stats Contains the calling statistics resulting from the
comparison of both experiments. Equivalent to the
option:
Menu: Report / Statistics
in the visual version of the program.
If the “-structure” parameter is not defined, the calling will be performed
considering all markers as a single segment of HMM emissions.
If the parameter "-out" is not defined, the files that CHAVA creates will have the
string "CHAVA" as prefix.
If only one dye is defined ("-direct" + "-hmmDirect" or alternatively "-swap" +
"-hmmSwap"), the calling will be performed only for that dye and no statistics
file will be produced.
Bibliography
Gazave, E. et al. (2011). "Copy number variation analysis in the great apes reveals
species-specific patterns of structural variation." Genome Res. 2011
Oct;21(10):1626-39.