thesiscertificate this is to certify that the thesis entitled, performance optimizations and...

Performance optimizations and improved sampling techniques for Dirichlet Process Mixture models – Application to deep

sequencing of a genetically heterogeneous sample

THESIS

Submitted in partial fulfillment of the requirements of BITS C421T/422T Thesis

by

Arnab Bhattacharya

2006A8TS171G

Under the supervision of

Dr. Osvaldo Zagordi

(Designation???)

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI – GOA CAMPUS

30th April, 2010

Acknowledgements

I am grateful to Niko Beerenwinkel, Osvaldo Zagordi (D-BSSE – ETH Zurich, Switzerland) and Volker Roth (Department of Computer Science, University of Basel, Switzerland) for giving me to opportunity to work on their on-going research project. I would like to thank Gabriel Dissard for providing the deep sequencing data.

CERTIFICATE

This is to certify that the Thesis entitled, performance optimizations and improved sampling techniques for Dirichlet Process Mixture models – Application to deep sequencing of a genetically heterogeneous sample submitted by Arnab Bhattacharya, ID No. 2006A8TS171G in partial fulfillment of the requirements of BITS C421T/422T Thesis embodies the work done by him under my supervision.

Signature of the Supervisor

30th April, 2010

Thesis Abstract

The problem of local haplotype reconstruction and read error correction for a set of reads obtained from a deep sequencing experiment on a genetically diverse sample can be solved in a Bayesian manner using the Dirichlet process mixture. The current model used to formulate and solve the clustering problem (Zagordi et al., 2009) utilizes a standard Gibbs sampler to sample from the joint posterior distribution of haplotype sequences, assignment of reads to haplotypes and sequencing error rate to determine the local haplotypes present in the population and estimate the frequency of each. The model, while effective, tends to be computationally intensive. A new model is proposed that introduces a pre-clustering of identical reads before the sampling process, a prior on the error parameter and other performance optimizations. The performance of both models is compared on simulated data and on experimental deep sequencing data obtained from HIV samples.

Table of Contents

Introduction

The advent of next generation sequencing (NGS) technologies, also known as deep sequencing, has changed the way we think about scientific approaches in basic, applied and clinical research. The major advantage of NGS over traditional Sanger sequencing is the ability to produce enormous volume of data. The volume of data produced in some cases is the of the order of one billion short reads per instrument run. The drawback of NGS technologies is that the sequence reads are typically shorter and more prone to error. The sheer volume of deep sequencing data coupled with the significant error rate presents many statistical and computational challenges when analyzing the data.

Traditional Sanger sequencing of a genetically diverse sample results in a consensus sequence of the population which makes it difficult to detect variants in the population present at low frequencies. This is overcome in deep sequencing which can potentially be used to accurately quantify genetic diversity in a mixed sample. The computational method employed by Zagordi et al. (2009) uses a generative probabilistic model for clustering reads based on the Dirichlet process mixture (DPM). The DPM defines a prior distribution that captures the uncertainty in the number of haplotypes, which is characteristic of genetically diverse populations, and is eventually determined by the data. The unknown model complexity is estimated by the DPM, which is governed by a single hyperparameter that controls the probability of creating new clusters. They devise a Gibbs sampler for sampling from the posterior distribution of the parameters given the observed reads. The approach seeks to separate signals of biological variation from technical noise in data and also estimate the haplotype frequency in the sample which is indicated by the number of reads in each cluster. This algorithm is implemented in the software package ShoRAH, which is freely available at

http://www.cbg.ethz.ch/software/shorah

Sequencing technologies encompass a number of methods that are grouped broadly as template preparation, sequencing and imaging and data analysis. The type of data produced from each platform depends on the unique combination of specific protocols which distinguishes one technology from the other. These differences in data output presents additional challenges in data analysis as different approaches are required for different technologies. For example, reads obtained from Roche/454 technology are typically longer (~400 bp) than those obtained from Illumina/Solexa technology (~75 or 100 bp) but are more error-prone.

The statistical approach to the problem of quantifying genetic diversity in a sample (Zagordi et al., 2009) by analyzing deep sequencing data is computationally expensive. The approach can be applied with good results for data obtained from Roche/454 technology but is impractical for data obtained from Illumina/Solexa technology due to computational

contraints. Genetic diversity has significant clinical and public health consequences. In particular, the genetic diversity exhibited by HIV in a single infected patient presents complications in vaccine design. During an HIV infection, the virus replicates rapidly and has a high mutation rate creating several, highly diverse quasispecies of the virus. This phenomenon has been correlated with disease progression as these quasispecies adapt in response to selective pressures exerted by host virus-specific immune responses. The presence of a population diverse viral strains which are evolutionary related hamper efforts to develop effective treatments as vaccines must overcome the complex evolutionary dynamics.

A modified model is proposed that aims at reducing the number of reads entering the Gibbs sampler to reduce the computational complexity. The model exploits the fact that in Illumina/Solexa datasets a significant number of reads are identical to each other in local windows and pre-clusters the identical reads before entering the Gibbs sampler. The number of unique reads is typically much smaller than the total number of reads and hence enables the application of this approach to data obtained from Illumina/Solexa technology due to the reduction in computational complexity.

The modified model introduces a prior on the error parameter to ensure it is treated in a Bayesian fashion and establish a level of control by introducing hyperparameters which are estimates of the mean and variance of the error rate. The model also incorporates a method to deal with missing data in the reads. These changes as well as other computational optimizations have implemented in a modified version of the ShoRAH software package.

In the Methods section the model is formally defined and Equations for the Gibbs sampler are derived. In the Results section, the performance of both the algorithms is compared on simulated and real data to quantify the time saved by using the modified model. The results indicate that a significant amount of time is saved on Illumina/Solexa datasets with little or no loss in the accuracy of the reconstructed haplotype structure.

2. Methods

The original model is presented in Zagordi et al. (2009) which relies on the assumption that, in sequence space, reads tend to cluster around the true haplotypes, with a distribution determined by the error process, while haplotypes are separated by their true evolutionary distance. The major changes made to the model are described below:

The original model treated each read as a unique object comprising the sequence of the read. The new model proposes that each set of identical reads, i.e. reads having the same sequence of bases, be replaced by an object which comprises the read sequence and a weight indicating the number of reads contained in the object. The Gibbs sampler is then run using the objects representing each unique read sequence rather than the individual reads.

Thus the set of reads r = {r1, r2,…, rn} is replaced by a set of read objects:

r* = [ {r1*, w1}, {r2*, w2},…, {rq*, wq} ] for some q ≤ n

Dirichlet Process Mixture

The advantage of using the DPM model is that it overcomes the problem of having to choose the correct number of clusters, which is uncertain for the problem at hand as there is no a priori knowledge on the phylogenetic structure of the sample. The DPM provides a flexible Bayesian framework that captures the uncertainty in the number of clusters and the reads assigned to them. However, their flexibility comes at a cost as inference in DP mixture models is computationally expensive. This can be alleviated to an extent by the pre-clustering of identical reads as this reduces the number of observations to be clustered. The DPM introduces a prior on mixing proportions that introduces a prior on mixing proportions that leads to few dominating classes and is controlled by a single hyperparameter, α. The cluster assignment of a read object r*i is thus sampled according to the following probabilities:

(1)

Where ci is the class observation i is assigned to. n\i,c denotes the number of observations currently in class c except observation i itself and n is the total number of observations. When the identical reads are pre-clustered, these can be written as:

(2a)

(2b)

The above Equations state that the number of observations currently in the class c is a sum of the weights of all read objects currently in the class c and the total number of observations can also be written as the sum of the weights of all the read objects. The probability of creating a new class is controlled by the hyperparameter α and is also a function of the weight of the

read object. This implies that read objects with more observations are more likely to form new classes than ones with fewer observations.

The errors in the sequencing process are represented by an error parameter θ with which reads are sampled from different haplotypes. If the read object r*i comes from haplotype hk, then the assignment variable ci takes the value ci = k. The j-th base r*i,j of object i is sampled from the corresponding j-th base of hk,j of the haplotype according to a parameter θ, the probability that the base is drawn without error. The model assumes that bases at different positions are independent and subject to the same error rate.

Under this model, the probability of a set of read objects r* = [ {r1*, w1}, {r2*, w2},…, {rq*, wq} ] given their assignments c = {c1, c2, …., cn} to the haplotypes h = {h1, h2,….,hK} is given by

(3)

with

(4a)

(4b)

where J is the window length (i.e the length of the reads and of the haplotypes), II is the indicator function and B the alphabet of the bases.

The model assumes that all haplotypes originate from a reference genome through a mutation process similar to the error process that generates reads from haplotypes but with a different parameter γ. The reference genome is taken as the consensus sequence of all reads and is considered a prior as it remains fixed during the sampling process.

Gibbs Sampler

A Markov chain Monte Carlo algorithm is implemented to perform Gibbs sampling of the posterior DPM distribution under the model. The Gibbs sampler samples from a multivariate distribution by drawing a value for each single random variable from its conditional distribution given the values of all other variables and iterating over all. The variables that are sampled in each iteration are the assignment variables ci, the haplotype bases hk,j, the error parameter θ and the mutation parameter γ.

The conditional probabilities of the assignment variables ci take into account the Dirichlet prior,

(5)

The factor b ensures the normalization of the probabilities. Substituting the expression for the probability of an object given its assignment to a haplotype, these Equations become:

(6)

Where mi,k and m`i,k are defined similarly to Equation (4) as

(7a)

(7b)

The quantity p(r*i|h0) is the likelihood of the read object to come from any haplotype generated by the reference genome h0 and results from the choice of a prior for the haplotype p(h`). With the assumption that all haplotypes originate from a reference genome h0, the probability that a read object originates from any haplotype becomes

(8)

The haplotype probability p(hk|{r*i,j : ci = k}), i.e. the probability of haplotype k given all the read objects assigned to it, can be written for each sequence position as these are considered to be independent:

(9)

By means of Bayes’ theorem this probability can be rewritten as

(10)

with

(11a)

(11b)

The normalization constant Z must be chosen such that the sum over all possible alleles is one. Thus

(12)

Where mj,k(l) and m`j,k(l) are the same quantities defined in Equation (11) with hk,j = l.

Reads obtained from deep sequencing experiments may contain missing data at some sequence positions in the reads. When sampling a haplotype k at any sequence position j given all the read objects assigned to it, typically there is at least one read object which has observed data at that position such that mj,k and m`j,k can be computed and the haplotype is sampled according to Equation (12). However, a situation may arise wherein in a cluster k all read objects have data missing at one or more positions. In such a scenario, the haplotype must be sampled from a different source for these positions. This is facilitated by generating a frequency table at each position in the window to record the frequency of occurrence of each nucleotide at every sequence position. This is done before the sampling process using data from all reads and remains fixed throughout the sampling process. The frequency table at each position j can be represented as

Thus in the event that a haplotype hk,j cannot be sampled according to Equation (12) due to missing data in all read objects assigned to it ,i.e. {r*i,j : ci = k, wi} is missing for all i, the haplotype base is sampled following a single draw multinomial distribution given the frequency table at position j.

The modified model proposes a prior on the error parameter θ such that

θ ∼ Beta distributed

We define the total number of matches µ and mismatches µ` between reads objects and the haplotypes they are assigned to as follows

(13a)

(13b)

Under this model, the probability of θ given µ can be written as

(14)

The α and β parameters of the beta distribution can be controlled using two hyperparameters as follows

(15)

where ε1 and ε2 are prior estimates of the mean and variance of the error parameter θ respectively. These Equations can be rewritten as

(16)

where N is the total number of bases with no missing data, i.e. in the ideal case with no missing data in any of the read objects N is given by the product of the total number of reads n and the window length J

(17)

Since p(µ |θ) follows a binomial distribution, the conjugate prior can be estimated from a beta distribution with parameters:

Finally γ is estimated as

(18)

To summarize, read objects are assigned to existing haplotypes or they instantiate a new class according to Equation (6). Haplotypes are then sampled according to the read objects that they are associated with following Equation (10) or from the frequency table as the case may be. θ is then estimated from the beta distribution given above and finally γ is estimated according to Equation (18). The process is iterated. The starting point for the algorithm is a random assignment of read objects to a number of haplotypes (the default number is n/10). The assignment of the read objects to haplotypes is recorded after a burn-in phase. The reads contained in a read object are considered to originate from a single haplotype if it is assigned

to that haplotype in more than 90% of the iterations. The estimated frequency of each inferred haplotype is given by the sum of the weights of all read objects that were assigned to it.

3. Results

The performance of the modified algorithm is compared with the original on simulated and experimental data for local haplotype reconstruction and frequency estimation in single windows. For the simulations, datasets with Illumina/Solexa-like reads (36bp length) were generated from eight known haplotypes at pairwise distances between 6 and 15%. Datasets were generated with varying number of reads and the algorithms run for varying number of iterations.

The first batch of datasets considered a mixture of the eight haplotypes with frequencies 0.5, 0.2, 0.1, 0.1, 0.05, 0.02, 0.01 and 0.01 and a fixed error rate of 0.5%. The haplotypes inferred by both algorithms were compared to the known haplotypes to validate correct reconstruction of haplotypes (Figure 1). The figure indicates that both versions detect haplotypes with up to 2% frequency reliably. The reliability of reconstructing haplotypes at 1% frequency is lower and the two haplotypes at 1% frequency each tend to be clubbed together. In all instances the modified algorithm detects every haplotype detected by the current algorithm and for one dataset even detects a low frequency haplotype that goes undetected by the latter. The frequencies of inferred haplotypes estimated by each algorithm were compared to assess the accuracy of the modified version (Figure 2a). The figure indicates that the frequencies estimated by both algorithms are in agreement. The time taken by each algorithm for each dataset is reported in Figure 3 to gain insight into the amount of time saved by the new implementation. The plots indicate that for Illumina/Solexa-like reads a significant amount of time is saved. This can be attributed primarily to pre-clustering of identical reads as is illustrated in Figure 4 which shows the variation of time taken by the modified algorithm with an upper limit on the number of identical reads that may be pre-clustered into a single read object for a dataset with 100,000 reads (n = 100,000). As this upper limit is increased, the number of read objects formed decreases and the time taken steadily decreases and levels out as the limit approaches n. The plot also indicates that even when this limit is set equal to 1 thus disabling pre-clustering, the modified algorithm is faster than the original.

The second batch of datasets considered a mixture of the eight haplotypes with frequencies 0.5, 0.2, 0.1, 0.1, 0.05, 0.04, 0.09, 0.01 and a fixed error rate of 1.5%. Figure 2b illustrates that even at higher error rates frequencies of haplotypes estimated by the two algorithms, even for haplotypes frequencies as low as 0.1%, seem to be in agreement.

Finally the modified algorithm was tested on deep sequencing data obtained from Illumina/Solexa technology in a single window. The window contained 7201 reads, only 598 of which were unique. The result was validated by comparing the inferred haplotypes and their frequencies to those detected by the original algorithm. The time taken by the modified algorithm was less than half of that taken by the original.

4. Discussion

A modified model is proposed for deep sequencing of a genetically heterogeneous sample. The model aims at improving the performance of the existing algorithm for local haplotype reconstruction and read error correction. Datasets obtained from different new generation sequencing technologies vary considerably in terms of average read length, error rate, etc. The current algorithm implemented in the software ShoRAH is focused on the analysis of reads generated from Roche/454 technology (length ~250-400 bp) but not suited for reads obtained from Illumina/Solexa technology (length ~36-100 bp). The major barriers to application of the current algorithm to datasets from Illumina/Solexa technology are computational as these have a much larger number of reads. The proposed modified algorithm is a step towards making it feasible to analyze data of this magnitude in reasonable time frames.

The modified algorithm proposes several changes the most notable of which is the pre-clustering of identical reads before Gibbs sampling. Datasets from Illumina/Solexa technology typically contain reads that are identical in local windows and by pre-clustering these together the number of observations to be clustered in the sampling process is reduced saving a considerable amount of time. This procedure also causes the Markov Chain to converge faster to the optimal assignment thus the sampling process can be run for a fewer number of iterations without adversely affecting the result. In principle an upper bound can be set on the number of identical reads that can be pre-clustered into a read object but during the simulations it was observed that the results remained accurate even when this limit was set equal to the total number of reads n. The value of α in the DPM must be chosen such that during the sampling process, the probability of proposing new classes is not too small. The α value used for the modified algorithm is typically chosen to be larger than for the current algorithm to ensure that read objects with small weights have a reasonable probability of forming new classes.

While the pre-clustering of identical reads may show a significant performance improvement for shorter read lengths, little or no improvement can be expected for the longer reads obtained from Roche/454 technology as these datasets typically contain very few if any identical reads. The new method for counting mismatches between reads and haplotypes (see Supplementary Methods) however saves a considerable amount of time for longer read lengths.

The model still has room for improvement and to incorporate more information such as different types of sequencing errors, paired end data and improved sampling techniques. Possible improvement to the existing sampling technique could be a permutation-augmented sampler for the DPM focusing on more global moves (Liang et al., 2007).

5. Supplementary Methods

Counting mismatches between reads and haplotypes

In the ShoRAH program the basic comparison between a read and a haplotype involves counting the number of mismatches between the two. This was identified as the major computational bottleneck in the current version of the software ShoRAH. The time complexity of this calculation in the current implementation is proportional to the window length J as the reads as well as the haplotypes are stored as an array of integers from 0 to 5 which represent nucleotides A, C, G, T, indel ( ‘-’ ) and missing data (‘N’) respectively.

The computational complexity can be reduced drastically by using an implementation of Knuth’s algorithm to count the number of mismatches. This is done by first compressing the read and haplotype sequences to an array of integers of length J/10 + 1 to facilitate XOR operations to determine mismatches. The C++ code for this function is given below:

void conversion(int* converted_seq, unsigned short int* sequence, int J) {

//sequence is the original read/haplotype sequence

//converted_sequence will store the sequence in an array of size J/10 + 1

int i, j, temp;

for(i = 0; i < J/10 + 1; i++) converted_seq[i] = 0;

for(i = 0; i < J/10; i++) {

for(j = 0; j > 0; j--) {

temp = (int) pow(8.0, (double)j – 1);

converted_seq[i] += temp * sequence[10 * i + 10 – j];

}

}

for(j = 10; j > (10 – (J % 10)); j--) {

temp = (int) pow(8.0, (double)j – 1);

converted_seq[i] += temp * sequence[10 * i + 10 – j];

}

return;

}

Since the read sequences remain unchanged during the sampling process, these need only be converted once at the start of the process whereas the haplotype sequences will be converted before each mismatch count. The C++ code to count the number of mismatches is given below:

one_int = 153391689; //= 01111111111 (octal) = 001 001 001 … 001 (binary)

two_int = 306783378; //= 02222222222 (octal) = 010 010 010 … 010 (binary)

four_int = 613566756; //= 04444444444 (octal) = 100 100 100 … 100 (binary)

int* seq_distance_new(int* read, int* haplotype, int missing, int J) {

//read and haplotype are the converted read and haplotype sequences

//missing is the number of positions with missing data in the read

int *X, i;

X = new int[J/10 + 1]; //stores sequence with number of mismatches

dist[0] = 0; //Number of mismatches

dist[1] = 0; //Number of matches

for(i=0; i<J/10 + 1; i++, read++, haplotype++) {

X[i] = *read ^ *haplotype; //XOR

X[i] = (X[i] & one_int) | ((X[i] & two_int) >> 1) | ((X[i] & four_int) >> 2);

//Count the number of set bits (Knuth’s algorithm)

while(X[i]) {

dist[0] ++;

X[i] &= X[i] – 1;

}

dist[1] = J – dist[0];

dist[0] = dist[0] – missing;

free(X);

return dist;

}

The read and haplotype sequences are essentially stored as a sequence of octal numbers with each octal number (or sequence of 3 binary digits) representing a sequence position. Thus a simple XOR of the two sequences would generate a 000 binary sequence for a match at a position and any other sequence of 3 binary digits would indicate a mismatch. The statement

X[i] = (X[i] & one_int) | ((X[i] & two_int) >> 1) | ((X[i] & four_int) >> 2);

essentially converts any sequence of 3 binary digits other than 000 to 001 to indicate a mismatch at that position. The number of 1’s in this sequence is then calculated using Knuth’s algorithm, the time complexity of which is proportional to the number of 1’s or the number of mismatches between the reads and the haplotypes. To illustrate an example consider a window of length J = 12 and read and haplotype sequences:

read = A G T T - A A G N T N C

haplotype = A G C T - A A G A T G C

read sequence = 0 2 3 3 4 0 0 2 5 3 5 1

haplotype sequence = 0 2 1 3 4 0 0 2 0 3 2 1

converted read sequence = 000 010 011 011 100 000 000 010 101 011 101 001

converted hap. sequence = 000 010 001 011 100 000 000 010 000 011 010 001

X (after XOR) = 000 000 010 000 000 000 000 000 101 000 111 000

one_int = 001 001 001 001 001 001 001 001 001 001 001 001

two_int = 010 010 010 010 010 010 010 010 010 010 010 010

four_int = 100 100 100 100 100 100 100 100 100 100 100 100

X & one_int(X1) = 000 000 000 000 000 000 000 000 001 000 001 000

X & two_int >> 1 (X2) = 000 000 001 000 000 000 000 000 000 000 001 000

X & four_int >> 2 (X3) = 000 000 000 000 000 000 000 000 001 000 001 000

X1 | X2 | X3 = 000 000 001 000 000 000 000 000 001 000 001 000

number of 1’s = 3

dist[1] = J – number of 1’s = 12 – 3 = 9

dist[0] = number of 1’s – missing = 3 – 2 = 1

Thus we obtain the desired result of 1 mismatch and 9 matches between the read and the haplotype sequence.

This technique is computationally more efficient and considerably faster than going through the original read and haplotype sequences to detect mismatches.

References

Eriksson, N., Pachter, L., Mitsuya, Y., Rhee, S.-Y., Wang, C., Gharizadeh, B., Ronaghi, M., Shafer, R. W., and Beerenwinkel, N. (2008). Viral population estimation using pyrosequencing. PLoS Computational Biology, 4(4), e1000074.

Detected by both

Not detected by either

Detected only by modified algorithm

Figure 1: Comparison of haplotypes successfully reconstructed by both algorithms for varying number of reads run for a varying number of iterations of the Gibbs Sampler. 7 out of the 8 known haplotypes are successfully reconstructed for all datasets. Haplotype 8 at frequency 0.1% goes undetected in a few datasets and is only detected by the modified algorithm in one instance.

Figure 2a: Comparison of frequencies of inferred haplotypes estimated by the two algorithms for the first batch of datasets with the haplotypes with lowest frequency at 1%. The estimated frequencies are in agreement.

Figure 2b: Comparison of frequencies of inferred haplotypes estimated by the two alogirthms for the second batch of datasets with the haplotypes with lowest frequency at 0.1%. The estimated frequencies are in reasonable agreement even for very low frequency haplotypes.

Figure 3: Comparison of time taken by the two algorithms for varying number of reads run for a varying number of iterations of the Gibbs sampler. The plots indicate that for these datasets a significant amount of time is saved when using the modified model.

New

Old

Figure 4: Variation of time taken by the modified algorithm with an upper limit on the number of identical reads that may be pre-clustered into a single read object for a dataset with 100,000 reads (n = 100,000). The plot shows that as this upper limit is increased, the number of read objects formed decreases and the time taken steadily decreases and levels out as the limit approaches n. The plot also indicates that even when this limit is set equal to 1 thus disabling pre-clustering, the modified algorithm is faster than the original.

thesiscertificate this is to certify that the thesis entitled, performance optimizations and...

Documents