iterative learning of single individual haplotypes from...

22
Iterative Learning of Single Individual Haplotypes from High-Throughput DNA Sequencing Data Zrinka Puljiz and Haris Vikalo Electrical and Computer Engineering Department The University of Texas at Austin 8 th International Symposium on Turbo Codes & Iterative Information Processing Bremen, Germany, August 18-22, 2014 Iterative Learning of Single Individual Haplotypes 1 / 22

Upload: lamxuyen

Post on 29-Apr-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Iterative Learning of Single Individual Haplotypesfrom High-Throughput DNA Sequencing Data

Zrinka Puljiz and Haris Vikalo

Electrical and Computer Engineering DepartmentThe University of Texas at Austin

8th International Symposium on Turbo Codes & Iterative Information ProcessingBremen, Germany, August 18-22, 2014

Iterative Learning of Single Individual Haplotypes 1 / 22

Page 2: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Overview of the Talk

Motivation and background

DNA sequencing and studies of genetic variations

Haplotype assembly

data structure and problem formulation

graphical representation of the problem

existing methods

Communication systems analogy and belief propagation

haplotype assembly as a decoding problem

belief propagation algorithm

performance analysis, comparison with existing methods

Conclusions and future work

Iterative Learning of Single Individual Haplotypes 2 / 22

Page 3: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

DNA Sequencing: Discovering Genetic Blueprint

Determine the order of nucleotides in a DNA sequence

Human Genome Project: mapping the genetic blueprint

followed by sequencing more individuals, studies of genetic variations

Iterative Learning of Single Individual Haplotypes 3 / 22

Page 4: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Study of Genetic Variations in Humans

Humans are diploid organism with 23 pairs of chromosomes

chromosomes in a pair of autosomes are homologous

the most common type of variation are SNPs

Iterative Learning of Single Individual Haplotypes 4 / 22

Page 5: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Study of Genetic Variations in Humans Cont’d

Describing variations

SNP calling determines locations and type of polymorphisms

based on the detected SNPs, perform genotype calling

example: A/T, A/C, G/T

Genotypes provide only the list of unordered pairs of alleles

no association of alleles with one of the chromosomes in a pair

The complete information is provided by haplotypes

the list of alleles at contiguous sites in a region of a chromosome

example: (A,C,G) and (T,A,T)

fundamental for many applications (personalized medicine!)

Iterative Learning of Single Individual Haplotypes 5 / 22

Page 6: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Single Individual Haplotyping

Determine a haplotype of an individual using DNA sequencing

The SNP rate is low, typically estimated to be 10�3

high-throughput DNA sequencing provides reads that are too short

get pairs of fragments at opposite ends of a strand of known length

Iterative Learning of Single Individual Haplotypes 6 / 22

Page 7: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

A Fragment Conflict Graph Interpretation

Represent reads by nodes, conflicts by edges

fragments are in conflict if they cover a common SNP location but

have di↵erent nucleotides there (so, di↵erent chromosomes)

If data is error-free, conflict graph is bipartite

otherwise, the graph contains cycles

Iterative Learning of Single Individual Haplotypes 7 / 22

Page 8: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Various Formulation of the Haplotype Assembly Problem

If the conflict graph is not bipartite, assembly is non-trivial

Approach: minimize the number of transformation stepsneeded to alter the graph so that it becomes bipartite

minimum edge removal (MER), minimum fragment removal (MFR),

minimum SNP removal

Minimum error correction (MEC): find the smallest number ofnucleotides in reads whose flipping to a di↵erent value resolvesconflicts among the fragments from the same chromosome

essentially, remove cycles in the conflict graph by assuming the

fewest possible sequencing errors

NP hard, various methods: HapCut [Bansal & Banfa, 2008],

HapCompass [Aguiar & Istrail, 2013], HapTree [Berger et al., 2014]

Iterative Learning of Single Individual Haplotypes 8 / 22

Page 9: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Minimum Error Correction Formulation

Label bases in heterozygous sites as h1i

, h2i

2 {1, 0}define h = h1 = h2 = [h1

1 h12 . . . h1

n

]

Each read is as a ternary string with entries 0, 1 and ⇥organize reads into a matrix R, row r

i

is the i th read

R =

2

666664

x x 0 x x 1x 1 x x 0 x

x x 0 x 0 x

0 x x 1 x x

1 x 1 x x x

x x 1 x 0 x

x 0 x 0 x x

x x x 0 x 0

3

777775

The MEC formulation is concerned with minimizing Z over h,

Z =mX

i=1

min(hd(ri

,h), hd(ri

, h)), hd(ri

,h) =nX

j=1

d(ri ,j , hj)

Iterative Learning of Single Individual Haplotypes 9 / 22

Page 10: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Structure of the Data Matrix

Consider the error-free SNP fragment matrix

R =

2

666664

x x 0 x x 1x 1 x x 0 x

x x 0 x 0 x

0 x x 1 x x

1 x 1 x x x

x x 1 x 0 x

x 0 x 0 x x

x x x 0 x 0

3

777775

Let h = [0 1 0 1 0 1], and the “origin” of the reads in R bes = [0 0 0 0 1 1 1 1]. Then for a binary R

i ,j it holds

si

hj

Ri,j

0 0 00 1 11 0 11 1 0

Iterative Learning of Single Individual Haplotypes 10 / 22

Page 11: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Haplotype Assembly as a Decoding Problem

Collect indices {(ik

, jk

)} identifying positions where the m ⇥ nmatrix R has binary entries (1 k M)

Define the “code generating” matrix G,

G (l , k) =

(1 if l = j

k

or l = ik

+ n, 1 k M,

0, otherwise.

Example: for R=

0 1 ⇥⇥ 1 11 ⇥ 0

�, we construct

G=

2

664

1 0 0 0 1 00 1 1 0 0 00 0 0 1 0 11 1 0 0 0 00 0 1 1 0 00 0 0 0 1 1

3

775 .

Iterative Learning of Single Individual Haplotypes 11 / 22

Page 12: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Haplotype Assembly as a Decoding Problem Cont’d

Define a “message” m = [h s] and a “codeword” c = mG

c collects binary entries from an error-free data matrix R

Due to sequencing errors, entries in R erroneously flipped

this can be interpreted as the e↵ect of a binary symmetric channel

on c = mG

formally, y = c+ e = [h s]G+ e, where yk

= R(ik

, jk

)

Iterative Learning of Single Individual Haplotypes 12 / 22

Page 13: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Graphical Model

Graphical representation of the problem

Iterative Learning of Single Individual Haplotypes 13 / 22

Page 14: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Graphical Model Cont’d

Haplotyping with MEC criterion ⌘ min distance decoding

using the parity check matrix H: MEC = minH(y+e)=0

kek0

Iterative Learning of Single Individual Haplotypes 14 / 22

Page 15: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Belief Propagation for Haplotype Assembly

Graphical model for the belief propagation algorithm

Iterative Learning of Single Individual Haplotypes 15 / 22

Page 16: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Belief Propagation for Haplotype Assembly Cont’d

Iterative Learning of Single Individual Haplotypes 16 / 22

Page 17: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Belief Propagation for Haplotype Assembly Cont’d

Iterative Learning of Single Individual Haplotypes 17 / 22

Page 18: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Belief Propagation for Haplotype Assembly Cont’d

Stopping criterion: threshold, max # of iterations reached

Iterative Learning of Single Individual Haplotypes 18 / 22

Page 19: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Computational Complexity

Belief propagation algorithm:

Allow random restarts, MAXITER iterations

Schemes relying on parity-check need preprocessing:

Parity check matrix transformation

Complexity for each haplotype block:

O((#SNP +#Reads)⇥ (#entries in R))

This step depends on the locations of the binary entries in the

matrix R

Iterative Learning of Single Individual Haplotypes 19 / 22

Page 20: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Results on 1000 Genomes Project Data

Iterative Learning of Single Individual Haplotypes 20 / 22

Page 21: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Performance Guarantees

Found lower bounds on Pr{h 6= h}, E [|h� h|0], E [#switch errors]

[SVV, ITW 2014] Consider the haplotype of length n, error ratep, and probability of assembly error Pe = Pr{h 6= h|R}.

The number of reads m necessary for the assembly satisfies

m � (1� Pe)n

2[1� H(p)].

If m = ⇥(n ln n), one can determine h accurately with highprobability.

Specifically, given a target small constant ✏ > 0, there exists n large

enough such that by choosing m = ⇥(n ln n) the probability of error

Pe ✏.

Iterative Learning of Single Individual Haplotypes 21 / 22

Page 22: Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes

Summary and Future Work

Developed a novel framework for haplotype assembly

rephrased assembly as a decoding problem

belief propagation algorithm as a solution

outperforms existing methods on 1000 Genomes Project data

Several possible extensions

exploit possible prior SNP/genotype information

develop joint base/SNP/genotype calling and haplotype assembly

schemes

Explore other suitable methods and techniques

sparse low-rank matrix completion, spectral partitioning, correlation

clustering

Analyze limits of performance, experimental conditions neededto achieve desired accuracy

Iterative Learning of Single Individual Haplotypes 22 / 22