1 chava what is chava? chava (cnv hmm analysis visual

1

CHAVA

What is CHAVA?

CHAVA (CNV HMM Analysis Visual Application), is a visual tool developed

in JAVA conceived to help in the calling of Copy Number Variations (CNVs)

from data generated in array-based experiments of Comparative Genomic

Hybridization (aCGH).

In aCGH experiments, DNA from 2 individuals (we will call them "target" and

"reference" throughout this tutorial) is marked with two different dyes and

mixed. This DNA mix is hybridized into an array of probes with known genomic

positions and, for each probe, the difference of intensity of both dyes is

measured. We expect that in zones where both individuals have identical number

of DNA copies, the intensities will be similar. In those zones, however, where

the two individuals differ in copy numbers (for example one has 2 copies, 1 in

each chromosome, while the other has a full deletion) there will be differences in

intensities.

As a quality control measure, the experiment is often repeated swapping the dyes

used in each individual (throughout this tutorial we will refer to these two

experiments as "direct" and "swap"). Results from both experiments should be

equivalent. Given that we always measure the intensity of the same dye

compared to the other, regions with high values in one experiment should

correspond to regions with low values in the other and vice versa.

Data resulting from each of these experiments consist in a set of values (a

function of the differences in intensity for every probe together with its genomic

position.

Here we have an example of these data in the intensity file format that CHAVA

uses:

1 8000 -0.333 -0.336

1 8500 -0.175 -0.178

1 9000 0.191 0.188

1 9350 -0.3 -0.303

1 9900 -0.077 -0.08

1 10250 -0.448 -0.451

1 10800 0.32 0.317

1 11250 -0.224 -0.227

1 11650 -0.153 -0.156

Where every column corresponds, from left to right, to:

Chromosome

Position

Raw intensity difference

Intensity difference

Markers used in a CGH experiment can be distributed in a regular way along the

genome or they can be clustered around certain zones which are considered

interesting and candidates to present CNVs. There are common commercial

2

arrays, for example the oligonucleotide NimbleGen array formed by 385k probes

and employed to generate the data used in this tutorial, which have this kind of

clustered structure.

Once these experiments have generated the data on intensity differences, a

statistical analysis is required to decide which regions show evidence of CNV

presence. This decision process is known as "calling". There are diverse

procedures and available software packages to help us with the calling. An

approach using Hidden Markov Models (HMM) is frequently used.

HMM assumes that a string of data has been generated by an advancing

automaton which, at each step, has a certain state (one from a finite list of them)

and generates data for that step with probability according to a function of the

current state. This function is referred in HMM literature as the "emission

probabilities". Also, when advancing a step, the automaton can change from its

current state to a new one following also, a function of the current state (called

in this case "transition probabilities").

Given a string of data and emission and transition probabilities, we can estimate

the string of states that can have generated the data with the maximum

likelihood. In fact HMM is a model of choice in multiple problem-solving

situations because it allows for an exact solution for the maximum likelihood

combination to be found in short computational time using the Viterbi algorithm.

In the case of CNV calling, the string of data will be the relative intensity values

resulting from the aCGH experiment and the string of states that we have to

estimate will be a value of "deletion"/"identity"/"amplification" for each marker.

HMM is an efficient tool of direct application here because of the lineal nature

of the data we are working with (markers lying consecutively along the

genome). Using HMM, however, can be trickier when markers are clustered in

certain zones or when experiments present a lot of noise because of poor DNA

quality or other reasons. In these situations, determining the right HMM

parameters to use (emission and transition probabilities) becomes a complex and

cumbersome task.

CHAVA intends to help in these situations by means of providing a graphical

tool that allows visual navigation of CGH results, estimated calls and optionally

added further information. Using these tools and the help of consistency

statistics between experiments, the user is expected to be able to use his or her

visual intuition to optimize the HMM parameters that provide a satisfactory

HMM calling.

Downloading and Starting CHAVA

After unzipping the compressed file that we can download from the web

(http://bioevo.upf.edu/~cmorcillo/tools.htm), we should find the following

content:

3

Documents

Icons

cnv.bat

cnv.jar

Data

lib

The "Data" directory include files with aCGH intensities ratio data from a study

comparing CNV regions in primates (Gazave, E. et al. 2011). Measures in the

example files correspond to DNA from two chimpanzees hybridized against a

customized oligonucleotide NimbleGen array based on the human genome. In

the data we will use as examples, both case and reference samples are

chimpanzees but the array is made of human probes. Using crossed species can

cause added problems to the hybridization process that can result in worse

consistency between experiments and less reliability of the obtained calls.

CHAVA is a JAVA application that should run in every platform. To help the

graphical processes, the program is started using 1,4Gb of RAM memory. In

computers with small memories the application may run slowly. 2 GB of RAM

memory and screen resolution of 1280x600 or more are needed to run

comfortably.

* We start the program

java -jar -Xmx1400m cnv.jar

Alternatively, in windows environments we can simply click the cnv.bat file.

4

Uploading intensity data

We are now going to upload the intensity differences files for the direct and

swap experiments.

* Upload "direct file":

Menu File / Open direct File

Select file Data/px22_dir.data

We obtain:

5

Every vertical segment corresponds to a probe and its length is equivalent to the

log2 ratio of intensity differences between case and reference sample. Blue lines

mark the scale for intensity difference values (-2, -1, 0, +1 y +2 standard

deviations).

The experiment we just uploaded contains around 350,000 probes. In the current

window the first 10,000 probes appear. The program, as we will see later, allows

navigation along all the experiment with different levels of resolution.

It is expected that in regions without differences in CNVs between the target and

reference samples, the values of markers will drift around zero showing only

statistical noise. However, in regions with CNV differences, we expect a

consistent bias towards one direction along the whole fragment. In our data,

there is a region that clearly shows copy differences between target and

reference samples (marked in blue):

6

That can be interpreted as a lack of a certain number of copies in the target

sample that are present in the reference sample.

If we hover with our mouse over this region, we obtain its coordinates in the

Position Info Panel (marked in red in the previous image).

The figure above indicates us that we are hovering exactly over marker 2,089

located in chromosome 1 at position 2,662,663.

7

If the zones showing intensity differences are real CNVs and not experimental

artifacts, we expect a coincidence between the direct and swap experiments. To

check it, let's upload data from swap experiment:

* Upload "swap file":

Menu File / Open swap File

Select file Data/px22_sw.data

In the fragment that we just selected, we can see that the values of the swap

experiment fit perfectly with what we would expect: a deviation of similar

length and intensity but with inverted sign because dyes were swapped between

samples.

8

HMM Calling

However, not all fragments with CNVs are going to be as conspicuous as the one

we just saw and we need, moreover, a statistical criterion to distinguish the

candidate regions. For example, there is another region in this same window,

which presents smaller intensity differences but, since it is much longer, may be

also indicative of the presence of CNVs:

9

CHAVA, as explained in the introduction, uses a Hidden Markov Model to

decide the CNV calling.

Our first step will be to call CNVs using the program default HMM

configuration values without considering if they are suitable or not. (In fact they

are not and we are going to get a defective calling that we will improve along the

tutorial).

* For the direct experiment, press the RUN button placed in the left bottom next

to the Direct Information Panel (the panel, at the moment, looks quite empty,

but don't worry about this now) .

We obtain:

10

After running we can see, marked in red and green, those regions that HMM

estimation has considered as candidates to be regions with deletions and

amplifications target sample relative to the reference sample.

At this point, a set of statistics summarizing the results of the calling just

performed are written into the Direct Information Panel:

They inform us that the number of calls (defined as continuous segments with

the same status) of type deletion (-) and amplification (+) estimated along the

experiment are 377 and 274, respectively. They also show us the amount of

Kbases included in those calls. Recall that these numbers correspond to the

whole experiment and not only to the 10,000 first markers that are shown in the

window at this moment.

We run the HMM estimation also for the swap experiment

11

* For the swap experiment, press the RUN button placed in the right bottom

next to the Swap Information Panel.

We obtain:

We can see that, just as in the direct case, the colored segments and the statistics

corresponding to the swap experiment have appeared in the window.

Since we have now performed the HMM estimation for both experiments, an

additional set of information appears that compares them and gives us a glimpse

of their consistency. For example, if we look at the Swap Information Panel,

we see:

12

All these figures can be interpreted as follows:

521 calls of the deletion kind (-) have been estimated. Out of which 146 are

coincident with the direct experiment. A call is considered coincident if

somewhere in its span, there is a call of opposite sign in the complementary

experiment.

These 521 calls cover 54.3 Kbases and in 6.6 of these Kbases there is call

coincidence with the complementary experiment.

Finally, we have global figures for all kind of calls that tell us that coincidence

between both experiments is 34% taking calls as a comparison unit and 23% if

the coincidence is measured as DNA length.

Clearly, the CNV calling done until this point is quite unsatisfactory.

Numerically, because statistics are poor, and also visually we can see that the

calling didn't work. We have lots of very short calls which show, moreover, little

consistency between experiments. For example:

13

The images above are clearly full of false positives coming from statistical noise.

There are also some regions where there seem to be a clear CNV that have been

fragmented by the excessive sensitivity of the calling as has been performed up

to now:

So the next step is working with the HMM configuration options to try to

improve the quality of our calling.

HMM Configuration

The Hidden Markov model used in CHAVA for CNV estimation assumes the

existence of 3 states (identity, deletion, amplification). Each state has its

associated transition probabilities to other states and a probability of emission of

a continuous value (the log2 of the ration of intensities) that follows a normal

distribution defined by its mean and standard deviation.

Direct and swap experiments are processed using two independent models that

can have a different set of parameters. This makes biological sense, since we are

studying two different experiments, which can vary considerably in their

conditions of execution.

We can consult and modify the parameters of our models pressing the Config

Button placed next to the direct and swap Information Panels.

* Press the Config Button of the direct experiment:

14

A window appears with the values of the parameters of the direct model:

Basically, we can see that emissions for each state are centered in -1,0 and 1

with a standard deviation of 1. We can also see that transition probabilities tend

to favor that states change rarely and to prevent direct transitions between

deletion and amplification without crossing the identity state.

Using this window we can manually modify whatever parameter we want and

we can also save its content to a file or upload configurations previously saved

into files of the type *.hmc (hmm configuration).

In this case, we are going to upload now a file called "sd3.hmc".

* HMM configuration window MENU: File - Open

15

Open the file "Data/sd3.hmc":

HMM configuration should now look as follows

16

where the only thing that has changed from the previous configuration is the SD

of all emissions that now have value of 3.0 instead of the former 1.0.

Press the OK Button and the HMM configuration will be updated.

We perform the same operation with the HMM configuration of the swap

experiment

until we obtain, after uploading the same file: "sd3.hmc" or modifying the fields

manually the following configuration:

and we press the OK Button.

Now we can execute again the calling for both experiments:

* Press Run Button of direct experiment:

17

* Press Run Button of swap experiment:

and we get a calling that looks much cleaner and that is closer to what we

expected

18

Before going on with the different options of data analysis that CHAVA

provides, let's have a look at the navigation capabilities of the program to browse

all the data in the experiments.

Study Navigation

The initial view we have been working with until now presents the first 10,000

markers of both experiments. We can navigate along our study using the

Navigation Arrows.

To the Beginning / Back / Forward / To the end

If we keep pressing the Forward Arrow, we will have a quick landscape view

of the quality of the calling performed a moment ago. Notice what happens with

a few consecutive clicks:

20

(A red line appears when markers change chromosome)

etc...

Key combinations alt + right arrow and alt + left arrow are equivalent to

Forward and Back Buttons.

On the upper left corner we can find the Navigation Panel, where the first

marker and the total numbers of markers of the view can be set.

If we enter new values (with a maximum of 10,000 for the number of markers)

and press "enter" the image will adapt to the new indications.

For example

* Using the Navigation Panel we change our view so that first marker shown

will be 75,000:

(after entering your numbers into the First Marker Box, remember to press

enter and resist the temptation of pressing the reset Button )

21

Whenever we want to, we can zoom in with the mouse just by clicking on the

zone of interest.

* Click with the mouse on the red segment of the direct experiment:

22

We can go on zooming in until we reach the desired resolution level.

If we click again:

23

For zooming out one level we can use the "zoom out" option of the right mouse

button.

The other two options that can be called with the right mouse button allow us to

center the image on the clicked point and to save the current image into a PNG

format file.

If we want to get back directly to the bigger view (10,000 markers) instead of

zooming out step by step, we can press the Number of Markers Reset Button

in the Navigation Panel.

24

Tracks

The data generated by the experiment and the current calling visualization can

be combined with further information from other sources. In our case, we have

information about a previous CNV calling performed using a different algorithm

on these same data.

We have this information codified in 2 files: "direct_old_call.trk" and

"swap_old_call.trk" for the direct and swap experiments respectively.

These files describe a set of segments providing their genomic positions together

with the calling value.

CHR START END VALUE

1 703450 753950 0.2816

1 761376 795876 0.2193

1 822700 864000 -0.2277

1 2617163 2666063 -1.0053

1 12973688 12999288 -0.2548

1 12999788 13013288 0.5834

CHAVA will interpret the segments with values <0 as deletions and values>0 as

amplifications.

Let's upload these files as visual tracks into our graph:

* Go back to the original view using the First Marker and Number of

Markers Reset Buttons.

25

* Upload as a Direct Track the file with the previous direct experiment calling

information.

Menu File/ SetTracks / Direct

26

Select the file "direct_old_call.trk".

Press OK

The Direct Track appears over our direct data plot:

27

where the deletion calls appear as red segments and the amplifications as green

ones.

* Repeat the operation for the Swap Track with the file "swap_old_call.data"

using the option:

Menu File/ SetTracks / Swap

and we finally obtain:

28

Now we can navigate throughout the whole experiment using the movement and

zoom tools to compare the results of our current calling to those obtained with

the former technique. For instance:

29

Our experiments current calls can be easily converted into the Direct and Swap

Tracks. Let's do it:

* Menu: Report / Direct Calls / To Track

* Menu: Report / Swap Calls / To Track

30

We can see now that the Direct and SwapTracks coincide with our current

HMM estimated calls:

Go back to the initial view

31

All the previous options will be useful in the process of refining HMM

parameters.

As an example, we are going to modify the direct HMM configuration, run the

HMM estimation again and compare the newly obtained calls with the old

configuration ones which we recorded a moment ago as a track:

* Modify the direct experiment HMM configuration. Set SD=4 for all 3 states.

* Execute HMM estimation for the direct experiment with the new

configuration:

32

and that is what we obtain:

In the zone marked in red we can see how the new HMM estimation with SD=4

has created a unique continuous segment there where the old HMM estimation

with SD=3 created two different segments:

33

Besides the experiment associated Direct and Swap Tracks, there are other two

tracks, called Consensus I and Consensus II Tracks. They were originally

conceived to add information about the final resulting global call but they can

also be used to add whatever data that we may consider relevant to our study.

For example, in the file "Data/genes.trk", a track has been added with the list of

human genes, downloaded from the Homo Sapiens gene database of Ensembl

(GRCh37.p2). In the making of the track, a "LABEL" column with the HGNC

(HUGO gene name) for each gene was added:

START END CHR LABEL

33772367 33786699 1 A3GALT2

12776118 12788726 1 AADACL3

12704566 12727097 1 AADACL4

94458393 94586688 1 ABCA4

229652329 229694442 1 ABCB10

94883933 94984222 1 ABCD3

179068462 179198819 1 ABL2

76190036 76229364 1 ACADM

...

* Upload the file "Data/genes.trk" as a Consensus I Track

Menu: File/Set Tracks/Consensus I

34

We get the new track in blue in the center of the image, marking the genes, so

that it can be used as a reference when analyzing our CNV calls. If we hover

with the mouse over a track segment, the corresponding gene label will appear.

35

To make data visualization clearer, we can select at any given moment which

ones among the uploaded tracks are going to be shown, using the option:

Menu: View/Tracks:

Structure

In many cases, markers employed in arrays designed for aCGH are not

distributed in a uniform way along the genome but are clustered in certain

regions of interest. This raises a number of issues in the process of visualization

and analysis of data.

For instance, the NimbleGen array used to generate the data of this tutorial,

clustered probes in regions that presented evidence of harboring CNVs in

36

previous experiments that used a range of different technologies (Gazave, E. et

al. 2011).

If we slowly run the mouse over the markers of the plot and look at their genome

positions shown in the Position Info Panel, we can observe that sometimes

there are sudden changes that indicate regions without probe coverage. Of

course, if we enter a track with segments located in zones with no markers, these

segments will not be shown in CHAVA's panel. We can tell the program to

inform us when gaps beyond a given size are detected

* Visualize gaps between markers with the option:

Menu: View/By distance

Now, orange lines appear between those markers that are separated by a distance

bigger that the maximum allowed gap (by default 10Kb).

We can, of course, change that default gap size.

37

* Set the maximum allowed gap to 100Kb

menu "View/Set Max Gap"

now we have, logically, less gap marks.

Since we don't need it now, we can hide the Consensus I Track to simplify the

current view:

38

* Hide Consensus I Track

Menu: View/Tracks/Consensus I

The visualization of probe gaps can give us information that can be useful in

certain circumstances. The really interesting thing, however, is to have the

possibility of using CHAVA with any kind of array.

The distribution of probes along the genome, which has to be consistent with the

clusters of probes that we are using in our experiment in order to be of use,

involves the grouping of the markers in segments that can be treated

independently.

39

A structure can be, simply, a definition of a few segments we want to study in

our experiment. The file "structure_5_segments.str" contains a dummy definition

of 5 segments just to show the concept and see how it works:

CHR START END INIT_MARKER END_MARKER

1 476579 496779 299 337

1 12741839 12806959 3266 3396

1 13621050 13649400 4507 4559

1 63134919 63313697 5769 6145

1 70018759 70060225 6146 6230

* Upload the structure defined in the file "structure_5_segments.str"

Menu: File / Open Structure File

* Change to view mode "By Structure"

Menu: View/By Structure

40

And what we obtain is this:

Looking closer:

41

There are now vertical orange lines which indicate the limits of the segments

that we have defined plus a horizontal thick lines (pointed by the blue arrows)

which show the length of every segment and allow us to distinguish which zones

are inside or outside every segment.

When executing processes associated with the structure of the array , they will

apply only to those markers included in the defined segments.

The file "Data/structure.str", instead of having only 5 little segments defined, as

was the case of the previous structure, presents a clustering of almost all probes

of our data into segments.

* Upload the file structure "Data/structure.str"

Menu: File / Open Structure File

42

and we have:

Since the current structure covers almost all probes, the horizontal orange line,

that shows the segment length, seems to be continuous. There are, however, little

gaps that can be located:

* Using the Navigation Panel, let's go to position 312 with a 300 probes view

we obtain:

43

The regions marked by the red circles correspond to probes between close

segments that do not belong themselves to any segment. These zones will be

excluded when performing structure associated processes.

We are going to run now again the HMM estimation but, instead of taking the

list of probes as a continuous emission, there is going to be an independent

estimation for each of the segments defined in the structure.

* Go back to the initial view using both Reset Buttons in the Navigation Panel.

44

We encode the current calls into de Direct and Swap Tracks. Recall that we

performed our HMM estimation with emission SD=4 for all states in the direct

experiment and emission SD=3 in the swap experiment for all states too. Doing

this we will be able to compare differences between the global calling and the

segmented calling.

* Menu: Report / Direct Calls / To Track

* Menu: Report / Swap Calls / To Track

45

* Activate segmented mode

Menu: Process / Segmented

We execute again the CNV estimation for the direct and swap experiments.

* Press the Run Button for the direct experiment.

46

* Press the Run Button for the swap experiment.

47

The segmented calling generated has clear differences with the previous global

one (which, remember, is recorded in the Direct and Swap Tracks). Those

differences can fall into different categories as can be seen in the following

examples:

Coordinates of the regions shown are added so that readers can examine the zone

by themselves. In this case, there were no calls at all with the global estimation

(no track segments).

Among the new calls appeared with the segmented estimation, some of them

(those marked as "1") present high consistency between direct and swap and so

hint at a real thing there. They are interrupted by the region marked as "2" where

there are no calls because it doesn't belong to any segment (no horizontal orange

line in it) and so its markers have not been used for HMM estimation.

The new call marked as "3" doesn't present consistency among experiments and,

so, it is probably an artifact.

In general, HMM calling fragmentation turns the method more sensitive because

it removes the influence that the previous and posterior markers would have in a

global calling. This influence has to be considered as an artifact because those

markers outside the segment are not generally physically immediate and can be

located very far away in the genome.

48

In this case we see that the structured calling has created a call fragmentation in

previously continuous calls.

49

Here, in the first call, we have improved the consistency between experiments

because now we have, in the direct experiment, the deletion call complementary

to the call already present in the swap experiment. In the swap experiment,

however, a new non consistent call has appeared.

50

Finally, very often, both methods are totally equivalent.

Reporting

We can create reports with the calls obtained in our CNV estimation in the direct

and swap experiments.

* Create a report of the direct experiment calls into a file called: "direct.rpt"

Menu: Report / Direct Calls / To File

The file "direct.rpt" will be created with the following content:

CHR START END VALUE INIT_MARKER END_MARKER

1 703450 721000 1 480 490

1 761376 795876 1 503 571

1 2611463 2666063 -1 2009 2095

1 12813409 12899160 1 3396 3544

1 12999788 13014845 1 3724 3756

1 13071269 13142374 1 3757 3887

51

1 13147574 13180974 1 3888 3942

1 13352469 13401820 1 4139 4218

1 13408570 13457420 1 4219 4309

1 16639408 16659972 1 4645 4677

...

This file can be uploaded, if we want, into any of the CHAVA Tracks to be

used as graphical information.

A swap experiment calls report can be generated in an analogous way.

Statistics shown in the Direct and Swap Information Panels can also be written

into a file.

¨

* Generate a statistics report called "stats.std"

Menu: Report / Statistics

a file will be created with the following content:

Direct

Calls Kbs

Total Match Total Match

52

- 66 51 4906 4400

+ 86 69 6604 5346

All 152 120 11510 9747

Dye Coherence 0.7894737 0.846788

Swap

Calls Kbs

Total Match Total Match

- 115 73 7300 5346

+ 110 51 8706 4400

All 225 124 16007 9747

Dye Coherence 0.5511111 0.6089204

Finally, we can always save the current image in CHAVA into a PNG file using

the "Save png" option of the mouse right button.

Command Line Options

CHAVA offers the possibility of working in command line mode, so the

execution of CNV estimations can be scripted and automated.

Using the same files from the graphical tutorial above, we can execute the

following:

java -jar cnv.jar -command -direct px22_dir.data -hmmDirect sd3.hmc -

swap px22_sw.data -hmmSwap sd3.hmc -structure structure.str -out px22

Were the different parameters mean:

-command CHAVA is executed in command line mode and no

graphical window is open

-direct <file> The Intensity File with direct experiment data.

-swap <file> The Intensity File with swap experiment data.

-hmmdirect <file> HMM configuration file that will be used in the

direct experiment calling.

-hmmswap <file> HMM configuration file that will be used in the

swap experiment calling.

-structure <file> Structure File to be used in direct and swap calling.

53

-out <string> Prefix of all output files generated during the

execution of the program.

Using the former command, a CNV calling for the specified files will be

performed using their respective HMM definitions under the defined structure.

The following files will be generated:

px22_direct.report List of calls generated in the direct experiment

calling. Equivalent to the option:

Menu: Report / Direct Calls / To File

in the visual version of the program.

px22_swap.report List of calls generated in the swap experiment

calling. Equivalent to the option:

Menu: Report / Swap Calls / To File


px22.stats Contains the calling statistics resulting from the

comparison of both experiments. Equivalent to the

option:

Menu: Report / Statistics


If the “-structure” parameter is not defined, the calling will be performed

considering all markers as a single segment of HMM emissions.

If the parameter "-out" is not defined, the files that CHAVA creates will have the

string "CHAVA" as prefix.

If only one dye is defined ("-direct" + "-hmmDirect" or alternatively "-swap" +

"-hmmSwap"), the calling will be performed only for that dye and no statistics

file will be produced.

Bibliography

Gazave, E. et al. (2011). "Copy number variation analysis in the great apes reveals

species-specific patterns of structural variation." Genome Res. 2011

Oct;21(10):1626-39.