shantanu dutt, yang dai huan ren, joel fontanarosa university of illinois at chicago
DESCRIPTION
Selection of Multiple SNPs in Case-Control Association Study Using a Discretized Network Flow Approach. Shantanu Dutt, Yang Dai Huan Ren, Joel Fontanarosa University of Illinois at Chicago. Outline. Background: Genome Wide Association Study Problem Definition Previous Work Our Work: - PowerPoint PPT PresentationTRANSCRIPT
Selection of Multiple SNPs in Case-Control Association Study Using a
Discretized Network Flow Approach
Shantanu Dutt, Yang Dai
Huan Ren, Joel Fontanarosa
University of Illinois at Chicago
Outline
Background: Genome Wide Association Study Problem Definition Previous Work Our Work:
MIP Formulations Discretized Network Flow (DNF) Opt. Method DNF Solutions for k-SNP Selection w/
Clustering/Classification Experimental Results
Conclusions
Genetic Association Studies Goal: Find markers of variation that reliably distinguish
individuals with a disease from a healthy population
Single Nucleotide Polymorphisms (SNPs) are the simplest and most common form of variation in the human genome. Each chromosome has one of two alleles for each SNP
Possible Genotypes = {0/0, 0/1, 1/1} Variations measured at specific SNP loci have been shown
to be associated with numerous traits and diseases.
Person 1
chrom 1
chrom 2
SNPPerson 2
chrom 1
chrom 2
SNPPerson 3
chrom 1
chrom 2
SNP
Genetic Association Studies (contd)Genomic Variation
Altered Phenotype
- Individual traits (eg height, hair color)
- Causal factors for disease
- Increased risk factors for complex disease
Gene, Protein, or Cellular Alteration/Regulation
Images: pdb (ww.rcsb.org)Robbins and Cotran, 7th Ed 2005
Genetic Association Studies (contd)
Complex traits cannot be mapped to a single genetic locus Multiple interacting genetic influences combine with
environmental factors to produce an outcome
Gene Networks
A B ... X
Environment
Disease
Genetic Association Studies (contd) Genome Wide Association Study (GWAS):
Measure a large number of SNPs (typically 500K-1M) across the genome in a large case-control study (often >1000 patients)
Results are commonly reported based on individual χ2 values, ignoring potentially powerful interaction effects
It remains an open computational and statistical challenge to reliably analyze epistasis, or gene-gene interactions, in large-scale GWAS.
Different genetic variations common complex disease Problem Definition: For a given set P of cases and Q of
controls, classify the cases into different clusters and simultaneously select k significant marker SNPs for them (those that strongly distinguish these cases from the set Q)
In this paper, we present a new optimization technique called discretized network flow (DNF) for the above problem
Examples of Epistasis Methods Combinatorial
MDR = multifactor dimensionality reduction CSP = combinatorial search based prediction CPM = combinatorial partitioning method
Probabilistic BEAM = Bayesian Epistasis Association Mapping
Bayesian partitioning model resolved by Markov Chain Monte Carlo (MCMC) methods
megaSNPhunter Hierarchical learning algorithm (regression trees) Primarily considers local interaction effects
MDR: Ritchie et al, Gen Epid, 2003CSP: Brinza et al., WABI’06CPM: Nelson et al, Genome Research, 2001 BEAM: Zhang and Liu, Nature Genetics, 2007megaSNPhunter: Wan et al, BMC Bioinformatics, 2009
MDR1. Divide data into training and testing sets2. Select a set of N factors3. If (affected/unaffected) > T (e.g. T = 1.0) high risk; o/w low risk4. Select model with best misclassification error5-6. Estimate the model prediction error using the testing
data set. Repeat these steps for each cross validation iteration, and for each
possible combination of factors.
Adapted from Ritchie et al, Gen Epid, 2003
CSP: Combinatorial Methods for Disease Association Search and Susceptibility Prediction
Risk/resistance factor multi-SNP combination (MSC) Problem: Find all MSCs significantly associated with the disease Cluster C: subset of S with an MSC, S : the original SNP set
d(C) : # of diseased, h(C) : # of non-diseased Combinatorial search Definition: Disease-closure of a multi-SNP combination C is a multi-SNP combination C’, with
maximum number of SNPs, which consists of the same set of disease individuals and minimum number of non-disease individuals. Searches only closed clusters
Closure of cluster C = C’ d(C’)=d(C) and h(C’) is minimized
Avoids checking of trivial MSCs Small d(C) implies not looking in subclusters
Finds faster associated MSCs but still too slow Tagging:
compress the SNP set by extracting most informative SNPs restore other SNPs from tag SNPs multiple regression method for tagging
Brinza, D., Zelikovsky, WABI’06
Our Work: MIP Formulation Notations:
pi,j(x) (0≤j ≤2): =1 if allele j present on SNP i for individual x;
=0, otherwise. Marker mi,j
val (val=0,1): mi,j1 means presence of allele j in SNP i
mi,j0 means absence of allele j in SNP i
Per-case benefit function of SNP i and allele jnc is # of controls,1
, ,
( )( ) | ( ) |
nc
i jzi j i j
p zb x p x
nc
Claim bi,j(x) is consistent with the specificity provided by selecting marker m i,j
pi,j(x)
When pi,j(x)=1:
bi,j(x) lower fraction of non-patients have pi,j=1= pi,j(x)
higher fraction of non-patients have pi,j=0= pi,j(x)
When pi,j(x)=0:
bi,j(x) higher fraction of non-patients have pi,j=1= pi,j(x)
MIP Formulation Benefit-based case-pair similarity metric
MIP formulation for selecting one marker set for all patients:
Otherwise (indicating mx,yval is not a common
marker for patients x and y)
•d(mi,jval) =1 if maker mi,j
val is selected; np is the # of patients/cases
•At most k markers will be selected• Linear MIP; MIP can be solved with commercial tools such as CPLEX/LINGO. However, very time consuming.
•The similarity definition ensures that only common markers among patients will be selected.
,( , , )vali js x y m , , , ,( ( )) ( ( )) if ( ) ( )i j i j i j i jb x b y p x p y val
,, ,1 1
3 0 1, ,1
0 1, ,,
MAX: ( , , ) ( )
S.T. ( ) ( ) 1 SNPs i
( ( ) ( ))
vali j
val vali j i jm x np y np
i j i jj
i j i ji j
s x y m d m
d m d m
d m d m k
MIP Formulation (contd) Issue 1:
Genetic reasons of a disease for diff. patient sets (e.g., w/ different ethnicity) can be different.
Hence, selecting only one marker set is not appropriate (it artificially forces one marker set on the entire patient pop).
Solution: Simultaneously cluster patients and select different markers for different clusters
,, ,1 g G 1 1
3 0 1, ,1
0 1, ,1 ,
MAX: ( , , ) ( )
S.T. ( ) ( ) 1 SNPs i and clusters
1 ( ( ) ( ))
vali j
val g g g vali j x y i jm x np y np
g gi j i jj
g g gx i j i jg G i j
s x y m b b d m
d m d m g
b x d m d m k
• bxg: if x is in cluster g dg(mi,j
val): if marker mi,jval is selected for cluster g.
At most G cluster will be generated.
• Cubic MIP!
MIP Formulation (contd)
Issue 2: the sum of benefit is not consistent with the specifity of a set of markers Essentially, the previous formulation will select five common
markers with the highest benefit. However, it is not optimal.
Mismatch marker 3
Mismatch marker 2
Control set
Mismatch marker 1
Mismatch marker 4
Individually, marker 1 and 2 provide larger speicfity than marker 3 and 4 (mismatch more controls).
However, the mismatch set of marker 1 and 2 have larger overlap.
Select marker 3 and 4 as the marker set gives overall higher specifity
MIP Formulation (contd) Adding accurate specifity terms to the obj. func. for each control z :
Mi(z) : whether control z matches the marker set selected for cluster i;
Mi(z) is the mod 2 addition (Boolean OR) of various 0/1 vars
gmis: objective function gain for mismatching a control.
1
(1 ( ))i misi G
M z g
11
(1 ( ))i misz nci G
M z g
,
, ,1 g G 1 1MAX: ( , , ) ( )val
i j
val g g g vali j x y i jm x np y np
s x y m b b d m
Final objective function
At least cubic MIP (if G <= 3)
gmis is determined so that specificity and sensitivity are given the same
weight.
Average gain for a patient matching a marker set: 2kbavgα(np/G), where
np is the number of patients, and G is the number of groups. gmis =2kbavg
α(np/G)*np/nc
s
(2,0)
(2,0)
(2,0)
(2,0) (1,2)
(1,1)
(1,4)(2,0)
(2,0)
T
Capacity cost
Discretized Network Flow (DNF) Standard min-cost network flow
Find a min cost way to send a certain amount of flow from the source node (S) to the sink node (T).
Solves certain LP problems (continuous solns) Some discrete constraints have to be staisfied in
order to solve discrete opt. problems like MIP One such constraint: Mutually exclusive arc set (MEA):
At most one arc of a subset of arcs in this set can have flow on it.
f=1
MEA
Invalid flowValid flow
Satisfying MEA requirements Adding a flow-amount-independent cost C’ to each arc in the set, A constant C’ cost is incurred whenever there is flow on the arc
Discretized Network Flow (contd)
Standard linear flow cost
Cap(e)f
c
Cap(e)f
c
C’
With C’ cost
C’ C’
C’ C’
MEA sets
C’inv≥C’val+C’
C’inv: total C’-related cost for invalid
flow
C’val: total C’-related cost for valid
flow
Cinv CvalCvalmin
Without C’
Determining C’: In the standard network flow graph
Discretized Network Flow (contd)
Heuristically select a valid
flow& determine its
cost Cval
Theorem [Ren et al., ICCAD’08]: A min-cost flow with C’-costs on MEA arcs ensures MEA satisfaction
Obtain min-cost flow of cost Cinv
min
w/o discretization constraints
Set C’=Cval-Cinvmin+1
Since C’inv≥C’val+C’, a valid flow is guaranteed to have a smaller cost than any invalid flow.
Cvalmin+
C’val
Cval+C’val
Cinv+C’inv
With C’
Discretized Network Flow (contd)
Discrete network flow has been applied to VLSI CAD problems [Ren et al., ICCAD’08], [Ren et al., IWLS’08], [Dutt et al., ICCAD’06] Good run time and scalability. At least 10x to 60x times faster than CPLEX with similar quality Example: determine optimal cell sizes in a circuit under an area constraint
Four sizes available. The number of 0/1 variables is about four times the number of cells considered.
y = 0. 3823x + 8. 5251
0
500
1000
1500
0 1000 2000 3000 4000
# of cel l s consi dered
run
time
(se
cs)
Run time vs. the number of cells from [Ren et al., IWLS’08]
DNF Model for Single-Cluster Marker Selection
P1
Pm
P1… …
Complete bi-partite graph with meta arcs
Pm
f=np*k
f=1
f=np
S T
Flow through pi,j node in Px means d(mi,jpi,j(x))=1
Pairwise connection between pi,j nodes ensures the same marker
set is selected for all Px
The flow cost incurred for selecting a common marker between two patients is: -s(x,y,mi,j
pi,j(x))
From S
p1,1
…(np,0)
MEA: only k arcs can have flow
(np,0)
MEA
S1
SN
p1,2
p1,3
pN,1
pN,3
MEA
…
p1,1
p1,3
pN,1
pN,3
To T
Px Py
(np*k,0)
(1, -s(x,y,pi,jci,j(x))) if ci,j(x)=ci,j(y)
No connection otherwise
capcost
Marker Selection for Multiple Clusters Use multiple copies of the single cluster network model
P1
S
P2
P3
P4
P1
P2
P3
P4
P1
P2
P3
P4
P1
P2
P3
P4
Complete bipartite
T
Choice nodes
Cluster 1
Cluster 2
Example valid flow: Puts patients {1,4} in cluster 1, and {3,2} in cluster 2.
Type 2 invalid flow:
Type 1 invalid flow:
Flow puts P1 in both cluster 1 and 2
Flow thru P1 passes thru P2 that is not in the same cluster, incurring false costs.
MEAMEA
MEA prevents invalid flows
For a G clusters will have G copies of the 2-level compl. bipartite graph; not all G clusters may be formed
Marker Selection for Multiple Clusters Issue: When G is large, the network flow graph become very complex
We use iterative bi-partitioning instead Much harder bi-part prob than standard bi-part; bi-part criterion needs to be selected simultaneously w/ bi-part!
Condition for stopping the bi-partitioning of a cluster: The spec+sens deteriorates
Meet termination condition
Meet termination condition
Final solution
Another run-time reduction technique: Patient pre-clustering
Group patients before using DNF. Greedy iterative grouping method
Initially, each patient is a subgroup
Each time merge the two subgroups with most common SNP-allele pairs.
Termination condition: patients in one group must have at least 70% SNP-allele pairs in common.
Each group is taken as a “meta patient” in DNF
Groups opened up after DNF, and metrics eval. at the individual level
Chain Structure for Improving Specificity
(1 ( ))i misCluster i
M z g
One chain structure for each controls. Two subchains: mismatched (MM) chain and matched (M) chain.
One injection arc to M subchain from each cluster: A1......Ag. Injection flow on arc Ai means z matches the selected marker set of cluster i (M i(z)=1).
Any injection flow causes the MEA condition to force chain flow into M chain, and never switch back. Hence, incur 0 cost.
Chain flow stays on the MM chain if no injection arc has flow, and incurs cost of -gmis
Cluster 1 Cluster 2 Cluster g
From S
T
cost=-gmis
MM chain
M chain
MEA MEA
A1
(1,0)A2 Ag
cost=0Chain structure for control z
(cap, cost)
Test 2Test 1
Experimental Results Data set we use
Crohn’s disease: 144 cases, 243 controls and 103 SNPs Autoimmune disorder: 384 cases, 652 controls and 108 SNPs Tick-borne encephalitis: 21 cases, 54 controls and 41 SNPs Rheumatoid arthritis: 460 cases, 460 controls and 2300 SNPs Lung cancer: 322 cases, 273 controls and 141 SNPs Rheumatoid arthritis (large): 868 cases, 1194 controls and 5000 SNPs
Prediction scheme with multiple cluster marker sets
Machine configurations: 3G cpu, 1G mem, Windows machine.
Marker set 1
Marker set 2 Mismatch
Match
Predict as sick
Mismatch
Mismatch
Predict as healthy
TP: correctly predicted as sickFP: falsely predicted as sickTN: correctly predicted as healthyFN: falsely predicted as healthy
Sensitivity=TP/(FN+TP)Specifity=TN/(FP+TN)Accuracy=(TN+TP)/(FP+TN+FN+TP)
Experimental Results
0
20
40
60
80
100
120
Autoi mm. Crohn Ti ck-borne
LungCancer
Rheum. Avg
Sens
MDR(k=5)DNF(k=5)DNF(k=10)
0
20
40
60
80
100
120
Autoi mm. Crohn Ti ck-borne LungCancer
Rheum. Avg
Spec
.
MDR(k=5)DNF(k=5)DNF(k=10)
38% relatively
79% relatively
# of clusters
K=5 K=10
Autoimm.
12 16
Crohn. 12 16
Tick-borne
6 6
Lung cancer
14 16
Rheum 13 14
Five-fold cross validation K=10 results for Rheum. (large, no comparisons available):
sens: 85; spec: 80; accuracy: 82 ;10 clusters; 21.5 h per training run Comparisons to MDR:
Sensitivity
87.6
56.7 81.9
48.8 88.4
Specifity
78.1
010203040506070
Autoi mm. Crohn Ti ck-borne
Avg
Run
time
(ks
ec)
DNF(k=10)CSP
0204060
80100120
Autoi mm. Crohn Ti ck- borne Avg
Geom
etri
c me
an
DNF(k=10)
CSP
0
20
40
60
80
100
120
Autoi mm. Crohn Ti ck-borne
Avg
Sens
.
DNF(k=10)CSP70
75
80
85
90
95
100
Autoi mm. Crohn Ti ck-borne
Avg
Spec
.
DNF(k=10)CSP
Experimental Results
36% relatively2.4%
relatively
Comparisons to CSP [Brinza commun. 4/09, Brinza et al., WABI’06 ppt: http://www.cs.ucsd.edu/~dbrinza/cv/present/brinza_wabi06.ppt] Leave-one-out validation
For DNF, 20 runs are performed with randomly chosen left-out individuals CSP performs n runs for n individuals (cases+controls)
8583.1
96.6 71.1
Specifity Sensitivity
18% relatively
90.6 76.8
Geometric mean of sens. and spec.
8 times3k24k
Run time (ksecs, per leave-out run)
0
20
40
60
80
100
120
Autoi mm. Crohn Ti ck- borne Avg
Accu
racy
DNF(k=10)
CSP
Experimental Results
19% relatively
Leave-one-out validation
Accuracy
Autoimm. 18
Crohn. 16
Tick-borne 6
Lung cancer
17
Rheum 14
90.876.6
Average number of clusters
Experimental Results
0. 910. 920. 930. 940. 950. 960. 970. 980. 99
1
Autoi mm. Crohn Ti ck-borne
LungCancer
Rheum. Avg
Qual
.
05
1015202530
Autoi mm. Crohn Ti ck-borne
LungCancer
Rheum. Avg
Run
time
0. 86
0. 88
0. 9
0. 92
0. 94
0. 96
0. 98
1
Autoi mm. Crohn Ti ck-borne
LungCancer
Rheum. Avg
Qual
.
Comparing to LINGO (<= 20% from optimal setting) Same MIP formulation is solved by LINGO, and we compare the MIP
objective function value and run time with DNF.
Comparisons are for 1 iteration of bi-partitioning and quad-partitioning (i.e. G=2,4)
Bi-p normalized quality (DNF is 1, the larger the better)
0.96
Bi-p normalized run time (DNF is 1, smaller is better)
15
05
101520253035404550
Autoi mm. Crohn Ti ck-borne
LungCancer
Rheum. Avg
Run
time
Quad-p normalized quality (DNF is 1)
0.95
23
Quad-p normalized run time (DNF is 1, smaller is better)
11
(1 ( ))i misz nci G
M z g
,
, ,1 g G 1 1MAX: ( , , ) ( )val
i j
val g g g vali j x y i jm x np y np
s x y m b b d m
Experimental Results
y = 8. 8x + 3658
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500
# of SNPs
Run
time
(ks
ec)
Run time vs. number of SNPs Rheumatoid arthritis data set is used Randomly chosen 100, 200, 400, 800, 1600, 2300 SNPs
Run time vs. number of patients Crohn’s disease data set is used No patient pre-clustering. Randomly chosen 30, 60, 90, 120, 144, patients from the
data set
y = 0.65x2 - 3x + 135
02468
10121416
0 40 80 120 160
# of patients
Ru
n t
ime
(kse
c)
Conclusions We proposed 0/1 non-linear MIP formulations to identify
disease markers. We consider patient clustering to identify most
appropriate marker sets The discretized network flow (DNF) method is used to
efficiently solve the MIP formulations. A chain structure is used for improving specificity Significant improvements compared to MDR and CSP Also much faster run times Can apply DNF to other computationally challenging
bioinfo problems since: DNF can efficiently & near-optimally solve polynomial and
Boolean MIPs DNF can also efficiently & near-optimally solve other discrete
optimization problems
If there is no flow on Ak
Appendix: Generating Injection Flow
……
Mi,jval nodes that
mismatch NPz
(1,0)
Draining arc
(1,-inf)
M chain
Ak
(1,0)
S
To T
Ak and Ak are coupled by a draining arc.
Cluster k
First a complementary injection flow is generated on a complementary arc Ak,
which is 1 if any mismatched marker for NPz is selected
Flow will be drained from Ak, and
cause injection flow to the chain
Ak
cap
cost
(1,C’) (1,C’)(2,C’)
MM chain
To T
(1,0)
If there is flow on AkFlow towards Ak is shunted to sink