iterative learning of single individual haplotypes from...
TRANSCRIPT
Iterative Learning of Single Individual Haplotypesfrom High-Throughput DNA Sequencing Data
Zrinka Puljiz and Haris Vikalo
Electrical and Computer Engineering DepartmentThe University of Texas at Austin
8th International Symposium on Turbo Codes & Iterative Information ProcessingBremen, Germany, August 18-22, 2014
Iterative Learning of Single Individual Haplotypes 1 / 22
Overview of the Talk
Motivation and background
DNA sequencing and studies of genetic variations
Haplotype assembly
data structure and problem formulation
graphical representation of the problem
existing methods
Communication systems analogy and belief propagation
haplotype assembly as a decoding problem
belief propagation algorithm
performance analysis, comparison with existing methods
Conclusions and future work
Iterative Learning of Single Individual Haplotypes 2 / 22
DNA Sequencing: Discovering Genetic Blueprint
Determine the order of nucleotides in a DNA sequence
Human Genome Project: mapping the genetic blueprint
followed by sequencing more individuals, studies of genetic variations
Iterative Learning of Single Individual Haplotypes 3 / 22
Study of Genetic Variations in Humans
Humans are diploid organism with 23 pairs of chromosomes
chromosomes in a pair of autosomes are homologous
the most common type of variation are SNPs
Iterative Learning of Single Individual Haplotypes 4 / 22
Study of Genetic Variations in Humans Cont’d
Describing variations
SNP calling determines locations and type of polymorphisms
based on the detected SNPs, perform genotype calling
example: A/T, A/C, G/T
Genotypes provide only the list of unordered pairs of alleles
no association of alleles with one of the chromosomes in a pair
The complete information is provided by haplotypes
the list of alleles at contiguous sites in a region of a chromosome
example: (A,C,G) and (T,A,T)
fundamental for many applications (personalized medicine!)
Iterative Learning of Single Individual Haplotypes 5 / 22
Single Individual Haplotyping
Determine a haplotype of an individual using DNA sequencing
The SNP rate is low, typically estimated to be 10�3
high-throughput DNA sequencing provides reads that are too short
get pairs of fragments at opposite ends of a strand of known length
Iterative Learning of Single Individual Haplotypes 6 / 22
A Fragment Conflict Graph Interpretation
Represent reads by nodes, conflicts by edges
fragments are in conflict if they cover a common SNP location but
have di↵erent nucleotides there (so, di↵erent chromosomes)
If data is error-free, conflict graph is bipartite
otherwise, the graph contains cycles
Iterative Learning of Single Individual Haplotypes 7 / 22
Various Formulation of the Haplotype Assembly Problem
If the conflict graph is not bipartite, assembly is non-trivial
Approach: minimize the number of transformation stepsneeded to alter the graph so that it becomes bipartite
minimum edge removal (MER), minimum fragment removal (MFR),
minimum SNP removal
Minimum error correction (MEC): find the smallest number ofnucleotides in reads whose flipping to a di↵erent value resolvesconflicts among the fragments from the same chromosome
essentially, remove cycles in the conflict graph by assuming the
fewest possible sequencing errors
NP hard, various methods: HapCut [Bansal & Banfa, 2008],
HapCompass [Aguiar & Istrail, 2013], HapTree [Berger et al., 2014]
Iterative Learning of Single Individual Haplotypes 8 / 22
Minimum Error Correction Formulation
Label bases in heterozygous sites as h1i
, h2i
2 {1, 0}define h = h1 = h2 = [h1
1 h12 . . . h1
n
]
Each read is as a ternary string with entries 0, 1 and ⇥organize reads into a matrix R, row r
i
is the i th read
R =
2
666664
x x 0 x x 1x 1 x x 0 x
x x 0 x 0 x
0 x x 1 x x
1 x 1 x x x
x x 1 x 0 x
x 0 x 0 x x
x x x 0 x 0
3
777775
The MEC formulation is concerned with minimizing Z over h,
Z =mX
i=1
min(hd(ri
,h), hd(ri
, h)), hd(ri
,h) =nX
j=1
d(ri ,j , hj)
Iterative Learning of Single Individual Haplotypes 9 / 22
Structure of the Data Matrix
Consider the error-free SNP fragment matrix
R =
2
666664
x x 0 x x 1x 1 x x 0 x
x x 0 x 0 x
0 x x 1 x x
1 x 1 x x x
x x 1 x 0 x
x 0 x 0 x x
x x x 0 x 0
3
777775
Let h = [0 1 0 1 0 1], and the “origin” of the reads in R bes = [0 0 0 0 1 1 1 1]. Then for a binary R
i ,j it holds
si
hj
Ri,j
0 0 00 1 11 0 11 1 0
Iterative Learning of Single Individual Haplotypes 10 / 22
Haplotype Assembly as a Decoding Problem
Collect indices {(ik
, jk
)} identifying positions where the m ⇥ nmatrix R has binary entries (1 k M)
Define the “code generating” matrix G,
G (l , k) =
(1 if l = j
k
or l = ik
+ n, 1 k M,
0, otherwise.
Example: for R=
0 1 ⇥⇥ 1 11 ⇥ 0
�, we construct
G=
2
664
1 0 0 0 1 00 1 1 0 0 00 0 0 1 0 11 1 0 0 0 00 0 1 1 0 00 0 0 0 1 1
3
775 .
Iterative Learning of Single Individual Haplotypes 11 / 22
Haplotype Assembly as a Decoding Problem Cont’d
Define a “message” m = [h s] and a “codeword” c = mG
c collects binary entries from an error-free data matrix R
Due to sequencing errors, entries in R erroneously flipped
this can be interpreted as the e↵ect of a binary symmetric channel
on c = mG
formally, y = c+ e = [h s]G+ e, where yk
= R(ik
, jk
)
Iterative Learning of Single Individual Haplotypes 12 / 22
Graphical Model
Graphical representation of the problem
Iterative Learning of Single Individual Haplotypes 13 / 22
Graphical Model Cont’d
Haplotyping with MEC criterion ⌘ min distance decoding
using the parity check matrix H: MEC = minH(y+e)=0
kek0
Iterative Learning of Single Individual Haplotypes 14 / 22
Belief Propagation for Haplotype Assembly
Graphical model for the belief propagation algorithm
Iterative Learning of Single Individual Haplotypes 15 / 22
Belief Propagation for Haplotype Assembly Cont’d
Iterative Learning of Single Individual Haplotypes 16 / 22
Belief Propagation for Haplotype Assembly Cont’d
Iterative Learning of Single Individual Haplotypes 17 / 22
Belief Propagation for Haplotype Assembly Cont’d
Stopping criterion: threshold, max # of iterations reached
Iterative Learning of Single Individual Haplotypes 18 / 22
Computational Complexity
Belief propagation algorithm:
Allow random restarts, MAXITER iterations
Schemes relying on parity-check need preprocessing:
Parity check matrix transformation
Complexity for each haplotype block:
O((#SNP +#Reads)⇥ (#entries in R))
This step depends on the locations of the binary entries in the
matrix R
Iterative Learning of Single Individual Haplotypes 19 / 22
Results on 1000 Genomes Project Data
Iterative Learning of Single Individual Haplotypes 20 / 22
Performance Guarantees
Found lower bounds on Pr{h 6= h}, E [|h� h|0], E [#switch errors]
[SVV, ITW 2014] Consider the haplotype of length n, error ratep, and probability of assembly error Pe = Pr{h 6= h|R}.
The number of reads m necessary for the assembly satisfies
m � (1� Pe)n
2[1� H(p)].
If m = ⇥(n ln n), one can determine h accurately with highprobability.
Specifically, given a target small constant ✏ > 0, there exists n large
enough such that by choosing m = ⇥(n ln n) the probability of error
Pe ✏.
Iterative Learning of Single Individual Haplotypes 21 / 22
Summary and Future Work
Developed a novel framework for haplotype assembly
rephrased assembly as a decoding problem
belief propagation algorithm as a solution
outperforms existing methods on 1000 Genomes Project data
Several possible extensions
exploit possible prior SNP/genotype information
develop joint base/SNP/genotype calling and haplotype assembly
schemes
Explore other suitable methods and techniques
sparse low-rank matrix completion, spectral partitioning, correlation
clustering
Analyze limits of performance, experimental conditions neededto achieve desired accuracy
Iterative Learning of Single Individual Haplotypes 22 / 22