shantanu dutt, yang dai huan ren, joel fontanarosa university of illinois at chicago

Selection of Multiple SNPs in Case-Control Association Study Using a

Discretized Network Flow Approach

Shantanu Dutt, Yang Dai

Huan Ren, Joel Fontanarosa

University of Illinois at Chicago

Outline

Background: Genome Wide Association Study Problem Definition Previous Work Our Work:

MIP Formulations Discretized Network Flow (DNF) Opt. Method DNF Solutions for k-SNP Selection w/

Clustering/Classification Experimental Results

Conclusions

Genetic Association Studies Goal: Find markers of variation that reliably distinguish

individuals with a disease from a healthy population

Single Nucleotide Polymorphisms (SNPs) are the simplest and most common form of variation in the human genome. Each chromosome has one of two alleles for each SNP

Possible Genotypes = {0/0, 0/1, 1/1} Variations measured at specific SNP loci have been shown

to be associated with numerous traits and diseases.

Person 1

chrom 1

chrom 2

SNPPerson 2

chrom 1

chrom 2

SNPPerson 3

chrom 1

chrom 2

SNP

Genetic Association Studies (contd)Genomic Variation

Altered Phenotype

- Individual traits (eg height, hair color)

- Causal factors for disease

- Increased risk factors for complex disease

Gene, Protein, or Cellular Alteration/Regulation

Images: pdb (ww.rcsb.org)Robbins and Cotran, 7th Ed 2005

Genetic Association Studies (contd)

Complex traits cannot be mapped to a single genetic locus Multiple interacting genetic influences combine with

environmental factors to produce an outcome

Gene Networks

A B ... X

Environment

Disease

Genetic Association Studies (contd) Genome Wide Association Study (GWAS):

Measure a large number of SNPs (typically 500K-1M) across the genome in a large case-control study (often >1000 patients)

Results are commonly reported based on individual χ2 values, ignoring potentially powerful interaction effects

It remains an open computational and statistical challenge to reliably analyze epistasis, or gene-gene interactions, in large-scale GWAS.

Different genetic variations common complex disease Problem Definition: For a given set P of cases and Q of

controls, classify the cases into different clusters and simultaneously select k significant marker SNPs for them (those that strongly distinguish these cases from the set Q)

In this paper, we present a new optimization technique called discretized network flow (DNF) for the above problem

Examples of Epistasis Methods Combinatorial

MDR = multifactor dimensionality reduction CSP = combinatorial search based prediction CPM = combinatorial partitioning method

Probabilistic BEAM = Bayesian Epistasis Association Mapping

Bayesian partitioning model resolved by Markov Chain Monte Carlo (MCMC) methods

megaSNPhunter Hierarchical learning algorithm (regression trees) Primarily considers local interaction effects

MDR: Ritchie et al, Gen Epid, 2003CSP: Brinza et al., WABI’06CPM: Nelson et al, Genome Research, 2001 BEAM: Zhang and Liu, Nature Genetics, 2007megaSNPhunter: Wan et al, BMC Bioinformatics, 2009

MDR1. Divide data into training and testing sets2. Select a set of N factors3. If (affected/unaffected) > T (e.g. T = 1.0) high risk; o/w low risk4. Select model with best misclassification error5-6. Estimate the model prediction error using the testing

data set. Repeat these steps for each cross validation iteration, and for each

possible combination of factors.

Adapted from Ritchie et al, Gen Epid, 2003

CSP: Combinatorial Methods for Disease Association Search and Susceptibility Prediction

Risk/resistance factor multi-SNP combination (MSC) Problem: Find all MSCs significantly associated with the disease Cluster C: subset of S with an MSC, S : the original SNP set

d(C) : # of diseased, h(C) : # of non-diseased Combinatorial search Definition: Disease-closure of a multi-SNP combination C is a multi-SNP combination C’, with

maximum number of SNPs, which consists of the same set of disease individuals and minimum number of non-disease individuals. Searches only closed clusters

Closure of cluster C = C’ d(C’)=d(C) and h(C’) is minimized

Avoids checking of trivial MSCs Small d(C) implies not looking in subclusters

Finds faster associated MSCs but still too slow Tagging:

compress the SNP set by extracting most informative SNPs restore other SNPs from tag SNPs multiple regression method for tagging

Brinza, D., Zelikovsky, WABI’06

Our Work: MIP Formulation Notations:

pi,j(x) (0≤j ≤2): =1 if allele j present on SNP i for individual x;

=0, otherwise. Marker mi,j

val (val=0,1): mi,j1 means presence of allele j in SNP i

mi,j0 means absence of allele j in SNP i

Per-case benefit function of SNP i and allele jnc is # of controls,1

, ,

( )( ) | ( ) |

nc

i jzi j i j

p zb x p x

nc

Claim bi,j(x) is consistent with the specificity provided by selecting marker m i,j

pi,j(x)

When pi,j(x)=1:

bi,j(x) lower fraction of non-patients have pi,j=1= pi,j(x)

higher fraction of non-patients have pi,j=0= pi,j(x)

When pi,j(x)=0:

bi,j(x) higher fraction of non-patients have pi,j=1= pi,j(x)

MIP Formulation Benefit-based case-pair similarity metric

MIP formulation for selecting one marker set for all patients:

Otherwise (indicating mx,yval is not a common

marker for patients x and y)

•d(mi,jval) =1 if maker mi,j

val is selected; np is the # of patients/cases

•At most k markers will be selected• Linear MIP; MIP can be solved with commercial tools such as CPLEX/LINGO. However, very time consuming.

•The similarity definition ensures that only common markers among patients will be selected.

,( , , )vali js x y m , , , ,( ( )) ( ( )) if ( ) ( )i j i j i j i jb x b y p x p y val

,, ,1 1

3 0 1, ,1

0 1, ,,

MAX: ( , , ) ( )

S.T. ( ) ( ) 1 SNPs i

( ( ) ( ))

vali j

val vali j i jm x np y np

i j i jj

i j i ji j

s x y m d m

d m d m

d m d m k

MIP Formulation (contd) Issue 1:

Genetic reasons of a disease for diff. patient sets (e.g., w/ different ethnicity) can be different.

Hence, selecting only one marker set is not appropriate (it artificially forces one marker set on the entire patient pop).

Solution: Simultaneously cluster patients and select different markers for different clusters

,, ,1 g G 1 1

3 0 1, ,1

0 1, ,1 ,

MAX: ( , , ) ( )

S.T. ( ) ( ) 1 SNPs i and clusters

1 ( ( ) ( ))

vali j

val g g g vali j x y i jm x np y np

g gi j i jj

g g gx i j i jg G i j

s x y m b b d m

d m d m g

b x d m d m k

• bxg: if x is in cluster g dg(mi,j

val): if marker mi,jval is selected for cluster g.

At most G cluster will be generated.

• Cubic MIP!

MIP Formulation (contd)

Issue 2: the sum of benefit is not consistent with the specifity of a set of markers Essentially, the previous formulation will select five common

markers with the highest benefit. However, it is not optimal.

Mismatch marker 3

Mismatch marker 2

Control set

Mismatch marker 1

Mismatch marker 4

Individually, marker 1 and 2 provide larger speicfity than marker 3 and 4 (mismatch more controls).

However, the mismatch set of marker 1 and 2 have larger overlap.

Select marker 3 and 4 as the marker set gives overall higher specifity

MIP Formulation (contd) Adding accurate specifity terms to the obj. func. for each control z :

Mi(z) : whether control z matches the marker set selected for cluster i;

Mi(z) is the mod 2 addition (Boolean OR) of various 0/1 vars

gmis: objective function gain for mismatching a control.

1

(1 ( ))i misi G

M z g

11

(1 ( ))i misz nci G

M z g

,

, ,1 g G 1 1MAX: ( , , ) ( )val

i j


s x y m b b d m

Final objective function

At least cubic MIP (if G <= 3)

gmis is determined so that specificity and sensitivity are given the same

weight.

Average gain for a patient matching a marker set: 2kbavgα(np/G), where

np is the number of patients, and G is the number of groups. gmis =2kbavg

α(np/G)*np/nc

s

(2,0)

(2,0)

(2,0)

(2,0) (1,2)

(1,1)

(1,4)(2,0)

(2,0)

T

Capacity cost

Discretized Network Flow (DNF) Standard min-cost network flow

Find a min cost way to send a certain amount of flow from the source node (S) to the sink node (T).

Solves certain LP problems (continuous solns) Some discrete constraints have to be staisfied in

order to solve discrete opt. problems like MIP One such constraint: Mutually exclusive arc set (MEA):

At most one arc of a subset of arcs in this set can have flow on it.

f=1

MEA

Invalid flowValid flow

Satisfying MEA requirements Adding a flow-amount-independent cost C’ to each arc in the set, A constant C’ cost is incurred whenever there is flow on the arc

Discretized Network Flow (contd)

Standard linear flow cost

Cap(e)f

c

Cap(e)f

c

C’

With C’ cost

C’ C’

C’ C’

MEA sets

C’inv≥C’val+C’

C’inv: total C’-related cost for invalid

flow

C’val: total C’-related cost for valid

flow

Cinv CvalCvalmin

Without C’

Determining C’: In the standard network flow graph


Heuristically select a valid

flow& determine its

cost Cval

Theorem [Ren et al., ICCAD’08]: A min-cost flow with C’-costs on MEA arcs ensures MEA satisfaction

Obtain min-cost flow of cost Cinv

min

w/o discretization constraints

Set C’=Cval-Cinvmin+1

Since C’inv≥C’val+C’, a valid flow is guaranteed to have a smaller cost than any invalid flow.

Cvalmin+

C’val

Cval+C’val

Cinv+C’inv

With C’


Discrete network flow has been applied to VLSI CAD problems [Ren et al., ICCAD’08], [Ren et al., IWLS’08], [Dutt et al., ICCAD’06] Good run time and scalability. At least 10x to 60x times faster than CPLEX with similar quality Example: determine optimal cell sizes in a circuit under an area constraint

Four sizes available. The number of 0/1 variables is about four times the number of cells considered.

y = 0. 3823x + 8. 5251

0

500

1000

1500

0 1000 2000 3000 4000

# of cel l s consi dered

run

time

(se

cs)

Run time vs. the number of cells from [Ren et al., IWLS’08]

DNF Model for Single-Cluster Marker Selection

P1

Pm

P1… …

Complete bi-partite graph with meta arcs

Pm

f=np*k

f=1

f=np

S T

Flow through pi,j node in Px means d(mi,jpi,j(x))=1

Pairwise connection between pi,j nodes ensures the same marker

set is selected for all Px

The flow cost incurred for selecting a common marker between two patients is: -s(x,y,mi,j

pi,j(x))

From S

p1,1

…(np,0)

MEA: only k arcs can have flow

(np,0)

MEA

S1

SN

p1,2

p1,3

pN,1

pN,3

MEA

…

p1,1

p1,3

pN,1

pN,3

To T

Px Py

(np*k,0)

(1, -s(x,y,pi,jci,j(x))) if ci,j(x)=ci,j(y)

No connection otherwise

capcost

Marker Selection for Multiple Clusters Use multiple copies of the single cluster network model

P1

S

P2

P3

P4

P1

P2

P3

P4

P1

P2

P3

P4

P1

P2

P3

P4

Complete bipartite

T

Choice nodes

Cluster 1

Cluster 2

Example valid flow: Puts patients {1,4} in cluster 1, and {3,2} in cluster 2.

Type 2 invalid flow:

Type 1 invalid flow:

Flow puts P1 in both cluster 1 and 2

Flow thru P1 passes thru P2 that is not in the same cluster, incurring false costs.

MEAMEA

MEA prevents invalid flows

For a G clusters will have G copies of the 2-level compl. bipartite graph; not all G clusters may be formed

Marker Selection for Multiple Clusters Issue: When G is large, the network flow graph become very complex

We use iterative bi-partitioning instead Much harder bi-part prob than standard bi-part; bi-part criterion needs to be selected simultaneously w/ bi-part!

Condition for stopping the bi-partitioning of a cluster: The spec+sens deteriorates

Meet termination condition

Meet termination condition

Final solution

Another run-time reduction technique: Patient pre-clustering

Group patients before using DNF. Greedy iterative grouping method

Initially, each patient is a subgroup

Each time merge the two subgroups with most common SNP-allele pairs.

Termination condition: patients in one group must have at least 70% SNP-allele pairs in common.

Each group is taken as a “meta patient” in DNF

Groups opened up after DNF, and metrics eval. at the individual level

Chain Structure for Improving Specificity

(1 ( ))i misCluster i

M z g

One chain structure for each controls. Two subchains: mismatched (MM) chain and matched (M) chain.

One injection arc to M subchain from each cluster: A1......Ag. Injection flow on arc Ai means z matches the selected marker set of cluster i (M i(z)=1).

Any injection flow causes the MEA condition to force chain flow into M chain, and never switch back. Hence, incur 0 cost.

Chain flow stays on the MM chain if no injection arc has flow, and incurs cost of -gmis

Cluster 1 Cluster 2 Cluster g

From S

T

cost=-gmis

MM chain

M chain

MEA MEA

A1

(1,0)A2 Ag

cost=0Chain structure for control z

(cap, cost)

Test 2Test 1

Experimental Results Data set we use

Crohn’s disease: 144 cases, 243 controls and 103 SNPs Autoimmune disorder: 384 cases, 652 controls and 108 SNPs Tick-borne encephalitis: 21 cases, 54 controls and 41 SNPs Rheumatoid arthritis: 460 cases, 460 controls and 2300 SNPs Lung cancer: 322 cases, 273 controls and 141 SNPs Rheumatoid arthritis (large): 868 cases, 1194 controls and 5000 SNPs

Prediction scheme with multiple cluster marker sets

Machine configurations: 3G cpu, 1G mem, Windows machine.

Marker set 1

Marker set 2 Mismatch

Match

Predict as sick

Mismatch

Mismatch

Predict as healthy

TP: correctly predicted as sickFP: falsely predicted as sickTN: correctly predicted as healthyFN: falsely predicted as healthy

Sensitivity=TP/(FN+TP)Specifity=TN/(FP+TN)Accuracy=(TN+TP)/(FP+TN+FN+TP)

Experimental Results

0

20

40

60

80

100

120

Autoi mm. Crohn Ti ck-borne

LungCancer

Rheum. Avg

Sens

MDR(k=5)DNF(k=5)DNF(k=10)

0

20

40

60

80

100

120

Autoi mm. Crohn Ti ck-borne LungCancer

Rheum. Avg

Spec

.

MDR(k=5)DNF(k=5)DNF(k=10)

38% relatively

79% relatively

# of clusters

K=5 K=10

Autoimm.

12 16

Crohn. 12 16

Tick-borne

6 6

Lung cancer

14 16

Rheum 13 14

Five-fold cross validation K=10 results for Rheum. (large, no comparisons available):

sens: 85; spec: 80; accuracy: 82 ;10 clusters; 21.5 h per training run Comparisons to MDR:

Sensitivity

87.6

56.7 81.9

48.8 88.4

Specifity

78.1

010203040506070


Avg

Run

time

(ks

ec)

DNF(k=10)CSP

0204060

80100120

Autoi mm. Crohn Ti ck- borne Avg

Geom

etri

c me

an

DNF(k=10)

CSP

0

20

40

60

80

100

120


Avg

Sens

.

DNF(k=10)CSP70

75

80

85

90

95

100


Avg

Spec

.

DNF(k=10)CSP


36% relatively2.4%

relatively

Comparisons to CSP [Brinza commun. 4/09, Brinza et al., WABI’06 ppt: http://www.cs.ucsd.edu/~dbrinza/cv/present/brinza_wabi06.ppt] Leave-one-out validation

For DNF, 20 runs are performed with randomly chosen left-out individuals CSP performs n runs for n individuals (cases+controls)

8583.1

96.6 71.1

Specifity Sensitivity

18% relatively

90.6 76.8

Geometric mean of sens. and spec.

8 times3k24k

Run time (ksecs, per leave-out run)

0

20

40

60

80

100

120

Autoi mm. Crohn Ti ck- borne Avg

Accu

racy

DNF(k=10)

CSP


19% relatively

Leave-one-out validation

Accuracy

Autoimm. 18

Crohn. 16

Tick-borne 6

Lung cancer

17

Rheum 14

90.876.6

Average number of clusters


0. 910. 920. 930. 940. 950. 960. 970. 980. 99

1


LungCancer

Rheum. Avg

Qual

.

05

1015202530


LungCancer

Rheum. Avg

Run

time

0. 86

0. 88

0. 9

0. 92

0. 94

0. 96

0. 98

1


LungCancer

Rheum. Avg

Qual

.

Comparing to LINGO (<= 20% from optimal setting) Same MIP formulation is solved by LINGO, and we compare the MIP

objective function value and run time with DNF.

Comparisons are for 1 iteration of bi-partitioning and quad-partitioning (i.e. G=2,4)

Bi-p normalized quality (DNF is 1, the larger the better)

0.96

Bi-p normalized run time (DNF is 1, smaller is better)

15

05

101520253035404550


LungCancer

Rheum. Avg

Run

time

Quad-p normalized quality (DNF is 1)

0.95

23

Quad-p normalized run time (DNF is 1, smaller is better)

11

(1 ( ))i misz nci G

M z g

,

, ,1 g G 1 1MAX: ( , , ) ( )val

i j


s x y m b b d m


y = 8. 8x + 3658

0

5

10

15

20

25

30

0 500 1000 1500 2000 2500

# of SNPs

Run

time

(ks

ec)

Run time vs. number of SNPs Rheumatoid arthritis data set is used Randomly chosen 100, 200, 400, 800, 1600, 2300 SNPs

Run time vs. number of patients Crohn’s disease data set is used No patient pre-clustering. Randomly chosen 30, 60, 90, 120, 144, patients from the

data set

y = 0.65x2 - 3x + 135

02468

10121416

0 40 80 120 160

# of patients

Ru

n t

ime

(kse

c)

Conclusions We proposed 0/1 non-linear MIP formulations to identify

disease markers. We consider patient clustering to identify most

appropriate marker sets The discretized network flow (DNF) method is used to

efficiently solve the MIP formulations. A chain structure is used for improving specificity Significant improvements compared to MDR and CSP Also much faster run times Can apply DNF to other computationally challenging

bioinfo problems since: DNF can efficiently & near-optimally solve polynomial and

Boolean MIPs DNF can also efficiently & near-optimally solve other discrete

optimization problems

If there is no flow on Ak

Appendix: Generating Injection Flow

……

Mi,jval nodes that

mismatch NPz

(1,0)

Draining arc

(1,-inf)

M chain

Ak

(1,0)

S

To T

Ak and Ak are coupled by a draining arc.

Cluster k

First a complementary injection flow is generated on a complementary arc Ak,

which is 1 if any mismatched marker for NPz is selected

Flow will be drained from Ak, and

cause injection flow to the chain

Ak

cap

cost

(1,C’) (1,C’)(2,C’)

MM chain

To T

(1,0)

If there is flow on AkFlow towards Ak is shunted to sink

shantanu dutt, yang dai huan ren, joel fontanarosa university of illinois at chicago

Documents

casecontrol association

genome research

human genome

ritchie et

testing data set

environmental factors

genetic influences

wan et