identification of snp alleles in dna sequences giuseppe lancia università di padova e celera...

85
Identification of SNP Identification of SNP Alleles in DNA Alleles in DNA Sequences Sequences Giuseppe Lancia Università di Padova e Celera Genomics

Upload: marvin-glenn

Post on 29-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

Identification of SNP Alleles Identification of SNP Alleles in DNA Sequencesin DNA Sequences

Giuseppe LanciaUniversità di Padova e Celera Genomics

Page 2: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

PolymorphismsPolymorphismsA polymorphism is a feature

Page 3: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

PolymorphismsPolymorphismsA polymorphism is a feature - common to everybody

Page 4: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

PolymorphismsPolymorphismsA polymorphism is a feature - common to everybody - not identical in everybody

Page 5: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

PolymorphismsPolymorphismsA polymorphism is a feature - common to everybody - not identical in everybody- the possible variants (alleles) are just a few

Page 6: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

PolymorphismsPolymorphisms

E.g. think of eye-coloreye-color

A polymorphism is a feature - common to everybody - not identical in everybody- the possible variants (alleles) are just a few

Page 7: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

PolymorphismsPolymorphismsA polymorphism is a feature - common to everybody - not identical in everybody- the possible variants (alleles) are just a few

E.g. think of eye-coloreye-color

Or blood-typeblood-type for a feature not visible from outside

Page 8: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

Page 9: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

The shortest possible sequence has only 1 nucleotide, hence

SSingle NNucleotide PPolymorphism (SNP)

Page 10: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

The shortest possible sequence has only 1 nucleotide, hence

SSingle NNucleotide PPolymorphism (SNP)

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

Page 11: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

The shortest possible sequence has only 1 nucleotide, hence

SSingle NNucleotide PPolymorphism (SNP)

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

Page 12: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

- SNPs are predominant form of human variations

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

- Used for drug design, study disease, forensic, evolutionary...

- On average one every 1,000 bases

Page 13: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

Page 14: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

Page 15: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUSHETEROZYGOUS: different alleles

Page 16: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUSHETEROZYGOUS: different alleles

Page 17: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUSHETEROZYGOUS: different alleles

HAPLOTYPEHAPLOTYPE: chromosome content at SNP sites

Page 18: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgt

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUSHETEROZYGOUS: different alleles

HAPLOTYPEHAPLOTYPE: chromosome content at SNP sites

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

Page 19: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUSHETEROZYGOUS: different alleles

HAPLOTYPEHAPLOTYPE: chromosome content at SNP sites

Page 20: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUSHETEROZYGOUS: different alleles

HAPLOTYPEHAPLOTYPE: chromosome content at SNP sites

GENOTYPEGENOTYPE: “union” of 2 haplotypes

OcE

EE

OaOg

OaE OaOt

EOg

OgE

Page 21: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

OcE

EE

OaOg

OaE OaOt

EOg

OgE

CHANGE OF SYMBOLSCHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).

Call them X and O. Also, call ? the fact that a site is heterozygous

HAPLOTYPEHAPLOTYPE: string over X,OGENOTYPEGENOTYPE: string over X,O,?

Page 22: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

xo xx

ox xo

ox oo

xx xx

xo oo

xo oo

xo xo

o?

??

xo

x? xx

?o

?o

CHANGE OF SYMBOLSCHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).

Call them X and O. Also, call ? the fact that a site is heterozygous

HAPLOTYPEHAPLOTYPE: string over X,OGENOTYPEGENOTYPE: string over X,O,?

Page 23: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

THE HAPLOTYPING PROBLEMTHE HAPLOTYPING PROBLEM

Single IndividualSingle Individual: Given genomic data of one individual, determine 2 haplotypes (one per chromosome)

Population Population : Given genomic data of k individuals, determine (at most) 2k haplotypes (one per chromosome/indiv.)

For the individual problem, input is erroneous haplotype data, from sequencing

For the population problem, data is ambiguous genotype data, from screening

OBJ is lead by Occam’s razor: find minimum explanation of observed data under given hypothesis (a.k.a. parsimony principle)

Page 24: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

Theory and Results

- Polynomial Algorithms for gapless haplotyping (Lancia, Bafna, Istrail, Lippert, Schwartz 01 & Bafna, Lancia, Istrail, Rizzi 02)

- Polynomial Algorithms for bounded-length gapped haplotyping (BLIR 02)

Single individual

- NP-hardness for general gapped haplotyping (LBILS 01)

- APX-hardness (Gusfield 00)

- Reduction to Graph-Theoretic model and I.P. approach (Gusfield 01)

Population

- New formulations and Disease Detection (Lancia, Pesole 02)

Page 25: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

The Single-IndividualThe Single-IndividualHaplotyping problemHaplotyping problem

Page 26: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

TGAGCCTAG GATTT GCCTAG CTATCTT

ATAGATA GAGATTTCTAGAAATC ACTGA

TAGAGATTTC TCCTAAAGAT CGCATAGATA

fragmentation

sequencing

assembly

Shotgun Assembly of a Chromosome [ Webber and Myers, 1997]

ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTTACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTTACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

Page 27: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

Sequencing errors:

ACTGCCTGGCCAATGGAACGGACAAG CTGGCCAAT CATTGGAAC AATGGAACGGA

Paralogous regions:

ACAAACCCTTTGGGACT … CTAGTAAACCCTATGGGGA AAACCCTT TAAACCCT CTATGGGA CCTATGG CTTTGGGACT ACCCTATGGG

ERROR SOURCESERROR SOURCES

Page 28: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

Given errorserrors (sequencing errors, and/or paralogous) the data may be inconsistentinconsistent with exactly 2 haplotypes

PROBLEMPROBLEM: Find and remove : Find and remove the errors so that the data the errors so that the data becomes consistent with becomes consistent with exactly 2 haplotypesexactly 2 haplotypes

Hence, assembler is unable Hence, assembler is unable to build 2 chromosomesto build 2 chromosomes

Page 29: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

ACTGAAAGCGA ACTAGAGACAGCATGACTGATAGC GTAGAGTCAACTG TCGACTAGA CATGACTGA CGATCCATCG TCAGCACTGAAA ATCGATC AGCATGACTGAAAGCGA ACTAGAGACAGCATGACTGATAGC GTAGAGTCAACTG TCGACTAGA CATGACTGA CGATCCATCG TCAGCACTGAAA ATCGATC AGCATG X X O O O X X X X X O

The data: a SNP matrix

Page 30: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

Snips 1,..,n

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O6 - - - - O O O X -

Fragments 1,..,m

Page 31: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

Snips 1,..,n

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O6 - - - - O O O X -

Fragments 1,..,m

Fragment conflict: can’t be on same haplotype

Page 32: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

Snips 1,..,n

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O6 - - - - O O O X -

Fragments 1,..,m

Fragment conflict: can’t be on same haplotype

1

6

2

3

4

5

Fragment Conflict Graph GF(M)

We have 2 haplotypes iff GF is BIPARTITE

Page 33: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

Snips 1,..,n

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O6 - - - - O O O X -

Fragments 1,..,m

1

6

2

3

4

5

PROBLEM (Fragment Removal): make GF Bipartite

Page 34: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

Snips 1,..,n

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O6 - - - - O O O X -

Fragments 1,..,m

PROBLEM (Fragment Removal): make GF Bipartite

1

6

2

3

4

5

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X4 O O X - - - - O -

3 X X O X X - - - -5 - - - - - - - X O

O O X O X X O O X

X X O X X - - X O

Page 35: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

Removing fewest fragments is equivalent to maximum induced bipartite subgraph

NP-complete [Yannakakis, 1978a, 1978b; Lewis, 1978] O(|V|(log log |V|/log |V|)2)-approximable [Halldórsson, 1999] not O(|V|)-approximable for some [Lund and Yannakakis, 1993]

Are there cases of M for which GF(M) is easier?

YES: the gapless M

---OXXOO---OXOOX--- gap

---OXXOOXOXOXOOX--- gapless

---OXX--XO----OX--- 2 gaps

Page 36: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

Why gaps?

Sequencing errors (don’t call with low confidence)

---OOXX?XX--- ===> ---OOXX-XX---

Celera’s mate pairs

attcgttgtagtggtagcctaaatgtcggtagaccttga

attcgttgtagtggtagcctaaatgtcggtagaccttga

Page 37: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

THEOREM

For a gapless M, the Min Fragment RemovalProblem is Polynomial

NOTENOTE: Does not need to be gapless. Enough if it can be sorted to become such (Consecutive Ones Property, Booth and Lueker, 1976)

Page 38: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

An O(nm + n ) D.P. algo3

1 - O O X X O O - -2 - - X O X X O - -3 - - - X X O - - - 4 - - - - O O X O - 5 - - - - - X O X O

Page 39: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

An O(nm + n ) D.P. algo3

1 - O O X X O O - -2 - - X O X X O - -3 - - - X X O - - - 4 - - - - O O X O - 5 - - - - - X O X O

LFT(i) RGT(i)

sort according to LFT

Page 40: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

An O(nm + n ) D.P. algo3

1 - O O X X O O - -2 - - X O X X O - -3 - - - X X O - - - 4 - - - - O O X O - 5 - - - - - X O X O

LFT(i) RGT(i)

D(i;h,k) := min cost to solve up to row i, with k, h not removed and put in different haplotypes, and maximizing RGT(k), RGT(h)

sort according to LFT

D(i; h,k) =

D(i-1; h,k) if i, k compatible and RGT(i) <= RGT(k) or i, h compatible and RGT(i) <= RGT(h)

1 + D(i-1; h, k) otherwise{

OPT is min h,k D( n; h, k ) and can be found in time O(nm + n^3)

Page 41: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

Th: NP-Hard if 2 gaps per fragment

proof: (simple) use fact that for every G there is M s.t. G = GF(M) and reduce from Max Bip. InducedSubgraph on 3-regular graphs

Th : NP-Hard if even 1 gap per fragment proof: technical. reduction from MAX2SAT

WITH GAPS…..WITH GAPS…..

But, gaps must be long for problem to be difficult.

We have O( 2 mn + 2 n ) D.P.

for MFR on matrix with total gaps length L

2L 3L 3

Page 42: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

The fragment removal is good to get rid of contaminants.

However, we may want to keep all fragments andcorrect errors otherwise

A dual point of view is to disregard some SNPs and keepthe largest subset sufficient to reconstruct the haplotypes

All fragments get assigned to one of the two haplotypes.We describe the min SNP removal problem: remove the fewest number of columns from M so that the fragmentgraph becomes bipartite.

Page 43: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -

SNP conflicts

Page 44: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -

SNP conflicts

OK

Page 45: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -

SNP conflicts

OK

Page 46: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -

SNP conflicts

OK

Page 47: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -

SNP conflicts

CONFLICT !

Page 48: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -

SNP conflicts

CONFLICT !

Page 49: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -

SNP conflicts

SNP conflict graph GS(M)1 node for each SNP (column)edge between conflicting SNPs

Page 50: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

1 2 3 4 5 6 7 8 9 - - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -

SNP conflicts

Page 51: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

1 2 3 4 5 6 7 8 9 - - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -

SNP conflicts

1

6

2

3

4

5

8

9

7

Page 52: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

1 2 3 4 5 6 7 8 9 - - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -

SNP conflicts

1

6

2

3

4

5

8

9

7

Page 53: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

THEOREM 1

For a gapless M, GF(M) is bipartiteif and only if GS(M) is an independent set

THEOREM 2

For a gapless M, GS(M) is a perfect graph

COROLLARY

For a gapless M, the min SNP removalproblem is polynomial

Page 54: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

THEOREM 1For a gapless M, GF(M) is bipartite if and only if

GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--OOXXOO-------------OOXOOXOXXO-----------XXOXOXXX-----XXOOXOXXO-----------XOOOX-----------XXXXXO-------XXOXXOXOO------

Assume M gapless, GS(M) an independent set, but GF(M)not bipartite.

Take an odd cycle in GF

Page 55: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

THEOREM 1For a gapless M, GF(M) is bipartite if and only if

GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?X???-------------O????????O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------

There is a generic structure of hor-vert cycle

Page 56: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

THEOREM 1For a gapless M, GF(M) is bipartite if and only if

GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?X???-------------O????????O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------

“vertical lines”

There cannot be only one vertical line in odd cycle

We merge rightmost and next to reduce them by 1

Hence, there cannot be a minimal (in n. of vertical lines) counterexample

Page 57: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

THEOREM 1For a gapless M, GF(M) is bipartite if and only if

GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?X???-------------O????????O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------

“vertical lines”

Must be X

Page 58: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

THEOREM 1For a gapless M, GF(M) is bipartite if and only if

GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?X???-------------O?????X??O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------

“vertical lines”

Must be X

Merge the rightmost lines

Page 59: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

THEOREM 1For a gapless M, GF(M) is bipartite if and only if

GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?X???-------------O?????X--------------??O----------??????X-------------???O------------????X--------X???????O------

“vertical lines”

Still a counterexample!

Merge the rightmost lines

Page 60: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

1 2 31 O - O 2 - O X 3 X X -

Note: Theorem not true if there are gaps

1

2 3

1

2 3

GF(M) GS(M)

M

Page 61: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

THEOREM 2For a gapless M, GS(M) is a perfect graph

PROOF: GS(M) is the complement of a comparability graph A

Comparability graphs are perfect

Comparability Graphs: unoriented that can be oriented to become a partial order

Page 62: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

LEMMA: If i<j<k and (i,k) is a SNP conflict then either (i,k) or (j,k) is also a SNP conflict

i j k - X O O ? X O X - - O X O ? X X X -

Equal:conflicts with i

OO

Different:conflicts with k

OX

i kj

I.e. if (i,j) is not a conflict and (j,k) is not a conflict, also (i,k) is not a conflict

So (u,v) with u < v and u not a conflict with v is a comparability graph Aand GS is A complement

NOTE: ind set on perfect graph is in P (Lovasz, Schrijvers, Groetschel, 84)

Page 63: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

THEOREM: The min SNP removal is NP-hard if there can be gaps (Reduction from MAXCUT)

Again, gaps must be long for problem to be difficult.

We have O(mn + n ) D.P.

for MSR on matrix with total gaps length L

2L + 1 2L + 2

Hence gapless MSR is polynomial (max stable set on perfect graph).

We have better, D.P., algorithms, O(mn + m^2)

What if gaps ?

Page 64: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

The PopulationThe PopulationHaplotyping problemHaplotyping problem

Page 65: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

The input is GENOTYPE data

oooxx

xxoxx

?x??x

????x

xx??x

INPUT: G = { xx??x, ????x, xxoxx, ?x??x, oooxx }

Page 66: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

The input is GENOTYPE data

xxoxxxxxox

oooxx

oooxxxxxox

xxoxxoxxox

xxoxxxxoxx

oooxxoooxx

xxoxx

?x??x

????x

xx??x

OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}

INPUT: G = { xx??x, ????x, xxoxx, ?x??x, oooxx }

Each genotype is explained by two haplotypes

We will define some objectives for H

Page 67: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

1st Objective1st Objective (open research problem):

minimize |H|

2nd Objective2nd Objective based on inference rule:

Page 68: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

xoxxooxoxx +********** =x??xoox?x?

known haplotype h

known (ambiguos) genotype g

Inference RuleInference Rule

Page 69: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

xoxxooxoxx +xxoxooxxxo =x??xoox?x?

known haplotype h

known (ambiguos) genotype g

new (derived) haplotype h’

Inference RuleInference Rule

Page 70: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

xoxxooxoxx +xxoxooxxxo =x??xoox?x?

known haplotype h

known (ambiguos) genotype g

new (derived) haplotype h’

We write h + h’ = g

g and h must be compatible to derive h’

Inference RuleInference Rule

Page 71: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while

Page 72: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

Page 73: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

ooooxooo??ooxx??

Page 74: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

ooooxooo??ooxx??

xxoo

Page 75: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

ooooxooo??ooxx??

xxoo xxxx SUCCESS

Page 76: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

ooooxooo??ooxx??

Page 77: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

ooooxooo??ooxx??

oxoo

Page 78: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

ooooxooo??ooxx??

oxoo FAILURE (can’t resolve xx?? )

OBJ: find order of application rule that leaves the fewest elements in GOBJ: find order of application rule that leaves the fewest elements in G

Page 79: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

- Problem is APX-hard (Gusfield,00)

- Graph-Model + Integer Programming for practical solution (G.,01)

Page 80: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

- Problem is APX-hard (Gusfield,00)

- Graph-Model + Integer Programming for practical solution (G.,01)

x??o?

1. expand genotypes

Page 81: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

- Problem is APX-hard (Gusfield,00)

- Graph-Model + Integer Programming for practical solution (G.,01)

x??o?

xxxox

xxxoo

xxoox

xxooo

xoxox

xooox

xoxoo

xoooo

1. expand genotypes

Page 82: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

- Problem is APX-hard (Gusfield,00)

- Graph-Model + Integer Programming for practical solution (G.,01)

x??o?

xxxox

xxxoo

xxoox

xxooo

xoxox

xooox

xoxoo

xoooo

2. create (h, h’) if exists g s.t. h’ can bederived from g and h

1. expand genotypes 3. Largest number of nodes in forest

rooted at unambiguos genotpes = = largest number of ambiguous genotypes resolved

Hence, find largest number of nodes in forest rooted at unambiguos genotpes. Use I.P. model with vars x(ij).

This reduction is exponential. Is there a better practical approach?

Page 83: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

3rd Objective3rd Objective (open research problem)Disease Detection:

oooxx

??oxx

?x??x

????x

xx??x

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

Page 84: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

3rd Objective3rd Objective (open research problem)Disease Detection:

xxoxxxxxox

oooxx

oooxxxxxox

xxoxxoxxox

xxoxxoooxx

oooxxoooxx

??oxx

?x??x

????x

xx??x

OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}

H contains H’, s.t. each diseased has one haplotype in H’ and each healty none

minimize | H’ |

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

Page 85: Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics

THE ENDTHE END © MMII G.L.