alessandra godi
DESCRIPTION
Solving Haplotyping Inference Parsimony problem using a polynomial class representative formulation and a set covering formulation. Alessandra Godi. Martine Labbé. Université Libre de Bruxelles. IASI (CNR) Roma. Airo Winter 2007 - Cortina d’Ampezzo , February 5th -9th, 2007. - PowerPoint PPT PresentationTRANSCRIPT
Alessandra GodiAlessandra Godi
IASI (CNR) IASI (CNR)
RomaRoma
Solving Haplotyping Inference Parsimony problem using a
polynomial class representative formulation and
a set covering formulation
Université Libre Université Libre de Bruxellesde Bruxelles
Martine LabbéMartine Labbé
Airo Winter 2007 - Cortina d’Ampezzo, February 5th -9th, 2007
The alphabet of life…
Base pairs (A-T, G-C) are complementary
DNA structure=Double Helix (Watson-Crick)
Basic unit = nucleotide: Sugar
PhosphateBase (A, G, T, C)
Humans have 23 pairs of chromosomes: 22 autosome pairs 1 pair of sex chrom.
Each chromosome includes hundreds of different genes.
In the nucleus of each cell, the DNA molecule is packaged into thread-like structures called chromosomes.
Human Chromosomes
CM1CM2 P1
C CP2
Children
CM CP
FatherMother
Human Chromosomes
AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATAATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGAT
AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATAATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGAT
AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCCAGGATAATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCCAGGAT
AATATATCGCTATCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATAATATATCGCTATCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGAT
AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATAATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGAT
AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATAATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGAT
Chromosomes
Chromosomes
A single ‘copy’ of a chromosome is called haplotype, while a description of the mixed data on the two ‘copies’ is called genotype.
For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is hard to collect.
Genotype data is easy to collect.
All humans are 99,99 % identical.
Diversity? polymorphismpolymorphism..
A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).
SNPs
SNP (Single Nucleotide Polymorphism)
A
GG
A
A
A
G
T
T
T
T
G
A
A
CC
C
C
C
C
CT
T
T
AATATATCGAATATATCG
AATATATCGAATATATCG
AATATATCGAATATATCG
AATATATCGAATATATCG
AATATATCGAATATATCG
AATATATCGAATATATCG
TCCGTATACCTATCCGTATACCTA
TCCGTATACCTATCCGTATACCTA
TCCGTATACCTATCCGTATACCTA
TCCGTATACCTATCCGTATACCTA
TCCGTATACCTATCCGTATACCTA
TCCGTATACCTATCCGTATACCTA
GGGGTGTGTGTACGGGGTGTGTGTAC
GGGGTGTGTGTACGGGGTGTGTGTAC
GGGGTGTGTGTACGGGGTGTGTGTAC
GGGGTGTGTGTACGGGGTGTGTGTAC
GGGGTGTGTGTACGGGGTGTGTGTAC
GGGGTGTGTGTACGGGGTGTGTGTAC
TGCTAGCACGCGTGCTAGCACGCG
TGCTAGCACGCGTGCTAGCACGCG
TGCTAGCACGCGTGCTAGCACGCG
TGCTAGCACGCGTGCTAGCACGCG
TGCTAGCACGCGTGCTAGCACGCG
TGCTAGCACGCGTGCTAGCACGCG
TGTGTAATATACGTGTGTAATATACG
TGTGTAATATACGTGTGTAATATACG
TGTGTAATATACGTGTGTAATATACG
TGTGTAATATACGTGTGTAATATACG
TGTGTAATATACGTGTGTAATATACG
TGTGTAATATACGTGTGTAATATACG
SNP (Single Nucleotide Polymorphism)
A
GG
A
A
A
G
T
T
T
T
G
A
A
CC
C
C
C
C
CT
T
T
SNP (Single Nucleotide Polymorphism)
Genotype: A/T T/G A C
Haplotype 1: A G A CHaplotype 2: T T A C
SNP 1 SNP 2 SNP 3 SNP 4A
GG
A
A
A
G
T
T
T
T
G
A
A
CC
C
C
C
C
CT
T
T
Hetero Hetero Homo Homozigous zigous zigous zigous
SNP: encoding
SNP 1 SNP 2 SNP 3 SNP 4A
GG
A
A
A
G
T
T
T
T
G
A
A
CC
C
C
C
C
CT
T
T
011000
100011
110000
000111
Genotype: 0/1 1/0 1 0
Haplotype 1: 0 1 1 0Haplotype 2: 1 0 1 0
2 2 1 0
Haplotyping of a population
Given a set of genotypes G (strings on {0,1,2}n alphabet), find a set of “generating” haplotypes HH (strings on {0,1}n alphabet).
genotype genotype individual individual
The GENOME is the set of genetic information which lies in the DNA sequence of each living organism.
The DNA sequence is a linear disposition of 4 different molecule, nucleotide, or bases:A, T, C, G.
The bases are paired each other by hydrogen bonds.
The DNA implies differences between the individuals of the same species.
What makes us different from each other is called polymorphism.
At DNA level: a Polymorphism is a nucleotide sequence which varies within a chromosome population:
atcagattagttagggcacaggacggacatcagattagttagggcacaggacggac
atcagattagttagggcacaggacgtacatcagattagttagggcacaggacgtacatccgattagttagggcacaggacgtacatccgattagttagggcacaggacgtac
atccgattagttagggcacaggacggacatccgattagttagggcacaggacggac
atccgattagttagggcacaggacgtacatccgattagttagggcacaggacgtac
atcagattagttagggcacaggacggacatcagattagttagggcacaggacggac
atcagattagttagggcacaggacggacgtacatcagattagttagggcacaggacggacgtac
atcagattagttagggcacaggacggacgtacatcagattagttagggcacaggacggacgtac
atcagattagttagggcacaggacggacggacatcagattagttagggcacaggacggacggac
atccgattagttagggcacaggacggacggacatccgattagttagggcacaggacggacggac
SSingle NNucleotide PPolymorphism (SNPSNP)
SSingle NNucleotide PPolymorphism (SNPSNP)
atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac
atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac
atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac
atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac
atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac
atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac
atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac
atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac
atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac
At DNA level: a Polymorphism is a nucleotide sequence which varies within a chromosome population:
atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac
atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac
atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac
atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac
atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac
atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac
atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac
atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac
atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac
HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes
atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac
atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac
atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac
atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac
atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac
atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac
atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac
atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac
atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac
HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes
ETEROZYGOUSETEROZYGOUS: different alleles
atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac
atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac
atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac
atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac
atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac
atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac
atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac
atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac
atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac
HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes
ETEROZYGOUSETEROZYGOUS: different alleles
atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac
atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac
atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac
atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac
atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac
atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac
atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac
atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac
atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac
HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes
ETEROZYGOUSETEROZYGOUS: different alleles
HAPLOTYPESHAPLOTYPES: chromosome at SNP level
aa gg
aa tt c c tt c c gg
cc tt a a gg
aa tt aa tt
a a gg c c gg
HAPLOTYPESHAPLOTYPES: chromosome at SNP level
ETEROZYGOUSETEROZYGOUS: different alleles
HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes
aagg aatt
cctt ccgg
cctt aagg
aatt aatt
aagg ccgg
GENOTYPESGENOTYPES: “union” of two haplotypes
OcE
EE
OaE
OaOt
EOg
HAPLOTYPESHAPLOTYPES: chromosome at SNP level
ETEROZYGOUSETEROZYGOUS: different alleles
HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes
aagg aatt
cctt ccgg
cctt aagg
aatt aatt
aagg ccgg
OcE
EE
OaE
OaOt
EOg
CODINGCODING: each SNP has only 2 possible values in a biological population. Let us call them ‘0’ and ‘1’. Moreover, let ‘2’ be the eterozygous site.
0011
0000 1100 1111
1100 0011
0000 0000
0011 1111
12
22
02
00
21
CODINGCODING: each SNP has only 2 possible values in a biological population.
:{0,1} {0,1,2}
0 0 = 0 1 1 = 1 0 1 = 1 0 = 2
HAPLOTYPING of a population
Given a set GG (strings in {0,1,2}n), find a set of
generator haplotypes HH (strings in {0,1}n)
genotype genotype individual individual
HAPLOTYPING of a population:State of the Art
Perfect Phylogeny (Bafna, Gusfield, Yooseph 02)
Estimation of haplotype frequencies(probabilistic studies: Fallin – Shork, 00)
Parsimony Objective (Gusfield 02, Brown 05)
HAPLOTYPING of a population:Parsimony Objective (NP-hard)
Combinatorial Methods (Gusfield 2002, Brown 2004, LANCIA –Rizzi, 2002):
Exponential and Polynomial ILP formulations
Rule-based methods (HAPINFER - Clark 1990):
Starting from genotypes, haplotypes are inferred
Statistical methods (PHASE- Stephens 2004, HAPLOTYPER – Niu 2001, GERBIL – Shamir 2005)
HAPLOTYPING of a population:our approach to the problem by using ILP
A new exponential
formulation
1. A pure set covering model obtained by Fourier-Motzking procedure by Gusfield (2002)model
2. A branch and cut procedure to decrease the number of constraints
A new polynomial formulation
A formulation using class representatives
A new polynomial formulation
I={h1,…, hq} a solution of the problem
genotypes of length nG={g1,g2,…,gm}
Main idea: class representatives
Each haplotype induces a subset of ordinated genotypes, and each geno belongs to exactly two of these subsets:
h1 {gi, gj, gk,…}
h2 {gi, gl, gr, gs…}
h3 {gk, gl, gs, gt…}
….
….
= Si
= Si’
= Sk
The smallest index geno identifies the subset; the prime appears if the correspondent index has been already used.
K’ = {1’, 2’, …, m’}K = {1, 2, …, m}
A new polynomial formulation
VARIABLES
yk{i,j}=
1
0 Otherwise
If geno gk belongs to two subset of geno’s, one having a geno with smallest index equal to i and the other one having the geno with smallest index j
k K i, j K K’
A new polynomial formulation
Ex:
h1 = 001 {g1, g2} = S1
g1= 021, g2= 002, g3 = 012
h2 = 011 {g1, g3} = S1’
y1{1,1’} = 1
Let us note that some y variables do not exist:
y2{1’,2’} = 0 If y2
{1’,2’} = 1
S1={g1,….}S1’={g1,g2….}
S2={g2,…}S2’={g2,…}
Absurd!!!
A new polynomial formulation
xi =
1 If there exists a subset of geno’s of the solution having geno i as geno with smallest index
i K K’ 0 Otherwise
zi,p =
0 i K K’ p SNP
1It is the value of the p-th coordinate of the haplo explaining the subset of geno’s used in the solution and having geno i as geno with smallest index
OBJECTIVE FUNCTION:
min xii K K’
VARIABLES
A new polynomial formulation
yk{i,j} 1 k K2.
i,j K K’, i≤k,
j≤k
CONSTRAINTS:
xi xi’ i K, i K’1.
A new polynomial formulationCONSTRAINTS:
yk{i,j} + yk
{i,j} ≤ xi k K3.
j K K’, j ≥ i
j K K’, j < i
i K K’,
yk{k,k’} ≤ xk’
k K3a. i = k’
A new polynomial formulationCONSTRAINTS:
4a. zi,p= 0 i K K’
pSNP s.t. gi(p)=0
4b. zi,p= 1 i K K’
pSNP s.t. gi(p)=1
4c. zi,p + zj,p = 1 {i,j} K K’
pSNP s.t. gi(p)=2
A new polynomial formulationCONSTRAINTS:
zi,p ≤ 1 - yk{i,j} - yk
{i,j} xi k K5.j K K’,
j ≥ i
j K K’, j < i
i K K’
pSNP : gk(p)=0
yk{k,k’} + zk’,p ≤ 1 k K, i = k’5a.
pSNP : gk(p)=0
A new polynomial formulationCONSTRAINTS:
zi,p ≥ yk{i,j} + yk
{i,j} k K6.j K K’,
j ≥ i
j K K’, j < i
i K K’
pSNP : gk(p)=1
zk’,p ≥ yk{k,k’} k K, i = k’6a.
pSNP : gk(p)=1
A new polynomial formulationCONSTRAINTS:
zi,p + zj,p ≥ yk{i,j}
k K7.
i,j K K’
pSNP : gk(p)=2
7a. zi,p + zj,p ≤ 2 - yk{i,j}
k K
i,j K K’
pSNP : gk(p)=2
10x10Opt zLP sec
zLP
LP iter
seczILP
MIP iter
B&B nodes
Poly 15 12 0,01 54 0,12 263 14
BrownModel[‘05]
15 2 0,05 140 4,85 16,646 1360
15x15Opt zLP sec
zLP
LP iter
seczILP
MIP iter
B&B nodes
Poly 27 22,83 0,01 173 0,08 173 11
BrownModel[‘05]
27 8 0,02 129 4.25 19.301 2.213
Preliminar results
20x20Opt zLP sec
zLP
LP iter
seczILP
MIP iter
B&B nodes
Poly 16 15 0,2 268 16 573 9
BrownModel[‘05]
16 3 O,07 598 27.604 16*106 540.623
Preliminar results
Let G be the genotype set and H the set of haplotypes which are compatible with some genotype in G.
^
INTEGER VARIABLES
Xh
1 if h is chosen
0 otherwise
1 if (h1,h2) is selected
0 otherwise
yh1,h2
For each g G
Pg = {(h1,h2) con h1,h2H | h1 h2 = g}^
From Gusfield’s formulation (2002)…
min Xh
hH
OBJECTIVE FUNCTION
^
CONSTRAINTS
1 g G1.
X 2.
yh1,h2
(h1,h2) Pg
yh1,h2h1
(h1,h2) Pg , g G
X 3. yh1,h2h2
(h1,h2) Pg , g G
From Gusfield’s formulation (2002)…
min xh
hH
1xh
h=h1 h=h2
g G
ˇ
x {0,1}n
…to a new set covering formulation by using the Fourier- Motzkin procedure
Set-Covering
s.t. (h1,h2) Pg
Genotype Structure +
Basic SC theory
Facets and
Valid Inequalities
g fixed fixedfreeN is the set of SNP
F
N\FF={pN: g(p) {0,1}}
Set-covering for HIP
1. The polytope HSC if full-dimensional IFF g G , |N\F|=2.
2. xj 0 is a facet for HSC IFF g G there exists hi s.t. hj hi=g, we have |N\F|=3.
3. xj 1 is facet j .
Proposition
g
g’
fixed fixed
fixed free
freefree
F
N\F
F’ N\F’
F={pN: g(p) {0,1}}
C=(N\F’)F
F’={pN: g’(p) {0,1}}
xi 1i S
Set-covering for HIP
N is the set of SNPs
|C|=|(N\F’)F|= 2 e (N\F)(N\F’)
|C|=|(N\F’)F| 3
TheoremLet us consider a genotype g and a subset S of haplotypes which are associated to a minimal set covering inequality:
This inequality is facet defining IFF for each genotype g’g one of the following conditions holds:
Set-covering for HIP
1.xh
h S
Set-covering for HIP
1st case: If |C|=|(N\F’)F|= 2 (N\F)(N\F’) =
2nd case : If |C|= |{p}|=1
If C= 3rd case :
the set covering inequality is dominated by another one that can be defined by using a SEQUENTIAL LIFTING procedure.
NOTE: For the following cases:
Set-covering for HIP: main idea
To overcome the exponential structure of the formulation:
1. Add only set-covering inequalities which are facet-defining
2. Add them in branch and cut procedure
Set-covering for HIP: a branch and cut procedure
a fractional solution of a subproblem of the original one
x*
g: (h1, h2 ) (h3,h4) (h5, h6) (h7, h8)
All set covering inequalities associated with g have the following structure:
x{1 or 2}+ x{3 or 4} + x{5 or 6}+ x{7 or 8} ≥ 1
Set-covering for HIP: a branch and cut procedure
We want to find a set covering inequality of g that violates x*
If it esists, we have found a set covering inequality which cut off x* !!!
We choose to add it to the system only if it is facet-defining.
min {x*1,x*2} + min {x*3,x*4} + min {x*5,x*6} + min {x*7,x*8} < 1
Branch and Cut preliminar results
Av. on max # of 2s
#constrmaster problem
#constr reduced problem
#added cuts
Solving time
50 genos10 SNPs
5 >60.000
7 30 0.00 sec
50 genos30 SNPs
8 >2512 7 200 0.05 sec
Average on 10 samples for each kind of instance generated by MS (Hudson, 2002) with recombination level r = 0
Future Works
On Polynomial formulation:
1. Strengthening of the model by Clique inequalities on genotype conflict graph
2. Cplex Concert Technologies3. More test vs other polynomial
formuationsOn Exponential formulation:
1. Implementation of Lifting Procedure2. More test in comparison with
Gusfield formulation