constructing gene association networks for complex human disordes

Constructing Association Networks using BGTA

Constructing Gene Association Networks forComplex Human Disordes

Using the BGTA Algorithm

Tian ZhengDepartment of Statistics

Columbia University

June 7th, 2008

1 / 25

Acknowledgements

I http://statgene.stat.columbia.eduI Collaborators

I Professors Shaw-Hwa Lo and Herman ChernoffI Graduate student: Yuejing DingI Our computer specialist: Lei cong

I The research presented is, in part, supported by NIH and NSF.

2 / 25

Motivation

I Complex traits –“... are caused by multiple genes interactingwith each other and with environmental factors to create agradient of genetic susceptibility to disease.” (Weeks andLathrop 1995)

I Gene-gene interactions may play a more important role incommon human disorders, which has made the identificationof disease-predisposing genes less successful.

I Multi-marker information collected in genetic studies providesa way to examine interaction information about a disease.

3 / 25

Motivation

3 / 25

Motivation

3 / 25

Motivation

Gene association networks

I In association mapping, we can consider the associationbetween a pair of genetic loci and the disease outcome.

I If the genetic combination (genotype) of the two loci are moreinformative about the disease risk than the individual locusinformation of them, one can say the interaction of these twoloci is associated with the disease.

I Combining such association information, one can construct anetwork among these identified genetic loci.

I Relevance of such a network to biological interactions is stillneed to be studied using biological tools.

4 / 25

Motivation

4 / 25

Motivation

4 / 25

Motivation

4 / 25

New Statistics for studying interactions

Interaction

I “Interaction” is a biological term and a statistical term.

I In biology, interaction means a joint action in a molecular oretiological sense.

I In statistics, interaction is usually based on a specific model.

I In our research, we study the joint association between a pairof loci and the disease trait and compare that to the individualassociation of these two loci. If the former is greater than thelatter, we define this as evidence of some “interaction”.

5 / 25

Interaction

5 / 25

Interaction

5 / 25

Interaction

5 / 25

Partition-based measure of influence

I Consider a set of k SNPs.

I Each SNP has three possible genotypes (A/A, A/B, B/B).

I This creates a partition Π of 3k elements.

I We can study the joint “influence” (association) of theseSNP on a trait Y using

IΠ =∑

nj2(Yj − Y )2 = nσ2

∑ nj

(Yj − Y

σ/√

IΠ/nσ2 ∼

∑ nj

1 under the null hypothesis.

6 / 25

IΠ =∑

nj2(Yj − Y )2 = nσ2

∑ nj

(Yj − Y

σ/√

IΠ/nσ2 ∼

∑ nj

6 / 25

IΠ =∑

nj2(Yj − Y )2 = nσ2

∑ nj

(Yj − Y

σ/√

IΠ/nσ2 ∼

∑ nj

6 / 25

IΠ =∑

nj2(Yj − Y )2 = nσ2

∑ nj

(Yj − Y

σ/√

IΠ/nσ2 ∼

∑ nj

6 / 25

(a) dist. of y dist. of I (shaded hist., black cdf) vs. asymp. (empty hist., red cdf)

cond. dist. of I (shaded hist., black cdf) vs. cond. asymp. (empty hist., red cdf)

bar plot: partition element sizes used in cond. dist.

partition: 20, observations: 100

0 2 4 6

0.000.050.100.150.200.25

0.5 1.0 1.5 2.0 2.5

0.20.40.60.81

1 2 3 4

0.00.51.01.5

0.20.40.60.81

0 2 4 6 8 10 12

0 2 4 6

0.000.050.100.150.200.25

0.6 0.8 1.0 1.2 1.4

012345

0.20.40.60.81

0.5 1.0 1.5

0.20.40.60.81

0 2 4 6 8 10 12

0 2 4 6

0.000.050.100.150.200.25

0.90 1.00 1.10

0.20.40.60.81

0.8 0.9 1.0 1.1 1.2

0.20.40.60.81

0 2 4 6 8 10 12

0 2 4 6 8

0.000.050.100.150.200.25

δδ = 1.5

1 2 3 4

0.20.40.60.81

1 2 3 4

0.00.51.01.52.0

0.20.40.60.81

0 2 4 6 8 10 12

0 2 4 6 8

0.000.050.100.150.200.25

δδ = 1.5

0.8 1.0 1.2 1.4 1.6

012345

0.20.40.60.81

1.0 1.5

0.20.40.60.81

0 2 4 6 8 10 12

0 2 4 6 8

0.000.050.100.150.200.25

δδ = 1.5

1.0 1.5 2.0 2.5 3.0

012345

0.20.40.60.81

1.0 1.5 2.0 2.5 3.0

0.20.40.60.81

0 2 4 6 8 10 12

0 2 4 6 8

0.000.050.100.150.200.25

δδ = 1.5

0.9 1.0 1.1 1.2 1.3

0.20.40.60.81

0.8 0.9 1.0 1.1 1.2

0.20.40.60.81

0 2 4 6 8 10 12

y I scores I scores ni

7 / 25

(a) dist. of y dist. of I (shaded hist., black cdf) vs. asymp. (empty hist., red cdf)

cond. dist. of I (shaded hist., black cdf) vs. cond. asymp. (empty hist., red cdf)

bar plot: partition element sizes used in cond. dist.

0 2 4 6

0.000.050.100.150.200.25

0.5 1.0 1.5 2.0 2.5

0.20.40.60.81

1 2 3 4

0.00.51.01.5

0.20.40.60.81

0 2 4 6 8 10 12

0 2 4 6

0.000.050.100.150.200.25

0.6 0.8 1.0 1.2 1.4

012345

0.20.40.60.81

0.5 1.0 1.5

0.20.40.60.81

0 2 4 6 8 10 12

0 2 4 6

0.000.050.100.150.200.25

0.90 1.00 1.10

0.20.40.60.81

0.8 0.9 1.0 1.1 1.2

0.20.40.60.81

0 2 4 6 8 10 12

0 2 4 6 8

0.000.050.100.150.200.25

δδ = 1.5

1 2 3 4

0.20.40.60.81

1 2 3 4

0.00.51.01.52.0

0.20.40.60.81

0 2 4 6 8 10 12

0 2 4 6 8

0.000.050.100.150.200.25

δδ = 1.5de

0.8 1.0 1.2 1.4 1.6

012345

0.20.40.60.81

1.0 1.5

0.20.40.60.81

0 2 4 6 8 10 12

0 2 4 6 8

0.000.050.100.150.200.25

δδ = 1.5

1.0 1.5 2.0 2.5 3.0

012345

0.20.40.60.81

1.0 1.5 2.0 2.5 3.0

0.20.40.60.81

0 2 4 6 8 10 12

0 2 4 6 8

0.000.050.100.150.200.25

δδ = 1.5

0.9 1.0 1.1 1.2 1.3

0.20.40.60.81

0.8 0.9 1.0 1.1 1.2

0.20.40.60.81

0 2 4 6 8 10 12

y I scores I scores ni

8 / 25

Backward Genotype-Trait Association methodI In Zheng et al. (2006), we proposed the Genotype-Trait

Distortion (GTD) score to measure association between a setof SNPs and the disease status.

I GTD uses the sum of squared difference between genotypedistributions among the cases and controls.

I This is a special case of the partition-based influence score:

IΠ =∑j∈Π

n2j (Yj − Y )2

=3m∑i=1

(nd ,i + nu,i )2

(nd ,i

nd ,i + nu,i− nd

nd + nu

)2 3m∑i=1

(nd ,i

9 / 25

IΠ =∑j∈Π

n2j (Yj − Y )2

=3m∑i=1

(nd ,i + nu,i )2

(nd ,i

nd ,i + nu,i− nd

nd + nu

)2 3m∑i=1

(nd ,i

9 / 25

IΠ =∑j∈Π

n2j (Yj − Y )2

=3m∑i=1

(nd ,i + nu,i )2

(nd ,i

nd ,i + nu,i− nd

nd + nu

)2 3m∑i=1

(nd ,i

9 / 25

Backward Genotype-Trait Association method

I GTD score changes when a SNP is removed from a set underevaluation.

I If GTD drops, this SNP is important and possibly interactwith the other SNPs in the set.

I If GTD increases, this SNP is not important given the otherSNPs in the set.

I In BGTA, we used a backward greedy screening on randomsubsets of SNPs to screen a big set of candidate SNPs.

I The screening results are saved as returning frequencies ofeach SNP and GTD scores of irreducible SNP clusters fromthe backward screening.

10 / 25

Definition of interaction

In our research, we have considered two methods to identify“interactions”:

I Screening return frequencies: SNPs that have interactionsassociated to the disease status are likely to be “returned”together more often than random.

I GTD scores: irreducible SNP clusters are local maxima fromthe backward screening based on random subset of SNPs.SNPs with high joint GTD score and are in a irreducible set,this is regarded as evidence that they are jointly associatedwith the disease status, thus “interaction”.

11 / 25

Definition of interaction

In our research, we have considered two methods to identify“interactions”:

I Screening return frequencies: SNPs that have interactionsassociated to the disease status are likely to be “returned”together more often than random.

I GTD scores: irreducible SNP clusters are local maxima fromthe backward screening based on random subset of SNPs.SNPs with high joint GTD score and are in a irreducible set,this is regarded as evidence that they are jointly associatedwith the disease status, thus “interaction”.

11 / 25

Examples

A simulation example: oligogenic trait with a gene network

Disease model:(a) Tri-locus genotypic risk array:

Penetrance Locus AGenotypes AA Aa aa

Locus B BB Bb bb BB Bb bb BB Bb bbEE n n D n n n n n n

Locus E Ee n n D n D D n n nee n n D n D D D D D

12 / 25

Examples

%��A1

��B1 ��E1

%��A2

��B2 ��E2

(a) Specified disease model

13 / 25

Examples

Joint return frequencies (Screening is done on 30 SNPs, 6 of whichassociated with the disease genes. )

Joint returns group1 group2(p-value) M1 → A1 M2 → B1 M3 → E1 M4 → A2 M5 → B2 M6 → E2M1 → A1 993 253 341 0 0 1

(< 10−15) (< 10−15)M2 → B1 253 823 3 6 1 1

(< 10−15)M3 → E1 341 3 755 144 10 1

(< 10−15) (3.1× 10−7)M4 → A2 0 6 144 656 80 304

(3.1× 10−7) (0.015) (< 10−15)M5 → B2 0 1 10 80 487 1

(0.015)M6 → E2 1 1 1 304 1 841

(< ×10−15)

14 / 25

Examples

��M1

��M2 ��M3� H ��M4

��M5

��M6��

(b) Network constructed from data

15 / 25

Examples

Application to Rheumatoid Arthritis II Rheumatoid Arthritis (RA) is a heterogeneous disease that

exhibits a complex genetic component.I We studied 349 controls and 474 cases with genotypes on

5407 SNPs throughout the genome.I We used a two-stage screening for this data set.

I First stage: use standard BGTA screening and select topapproximately 20% important markers.

I Second stage: further screening to identify important markerclusters.

I Significant markers were selected based on FDR estimatedusing permutations.

I For 39 identified loci that showed strong association with theRA, of which about 2/3 were found in the RA literature, weconstructed an association network among them usingassociation scores.

16 / 25

Examples

16 / 25

Examples

16 / 25

Examples

16 / 25

Examples

16 / 25

Examples

16 / 25

Examples

16 / 25

Examples

Application to Rheumatoid Arthritis I

17 / 25

Examples

clusters count gtd score gtd variance( 12, 15, 20, ) 1 0.0452466485 0.0000076622( 40, 47, ) 23 0.0447642803 0.0000057909( 12, 15, ) 23 0.0439568423 0.0000086410( 11, 21, 26, ) 1 0.0415141680 0.0000063145( 2, 28, 42, ) 1 0.0414631006 0.0000055331( 2, 28, 48, ) 1 0.0411999494 0.0000049326( 2, 28, 39, ) 2 0.0410552395 0.0000058928( 4, 19, 49, ) 1 0.0407363209 0.0000052556( 18, 34, 48, 992, ) 1 0.0405105566 0.0000067487( 2, 28, ) 28 0.0403270020 0.0000059208( 11, 21, 42, ) 1 0.0389051813 0.0000073572( 11, 13, 31, ) 1 0.0388380522 0.0000050127( 24, 41, ) 30 0.0387972020 0.0000038129( 11, 16, 42, ) 1 0.0386918428 0.0000067935( 9, 41, 43, ) 1 0.0386662864 0.0000069116( 11, 13, ) 23 0.0385128244 0.0000053157( 5, 6, 27, ) 1 0.0384394683 0.0000027612( 18, 48, ) 23 0.0383532977 0.0000087086( 17, 38, ) 26 0.0383203274 0.0000064260( 39, 40, 636, ) 1 0.0382886778 0.0000046856( 40, 636, ) 17 0.0381770392 0.0000047069( 11, 42, ) 32 0.0379699185 0.0000089074( 29, 45, 50, ) 1 0.0375913449 0.0000055796

18 / 25

Examples

19 / 25

Examples

Application to Rheumatoid Arthritis II

I A candidate gene study on RA.

I 20 SNPs from 14 candidate genes for RA.

I 839 cases and 855 unrelated controls.

I We evaluated all subsets of 20 SNPs and identified those thatare irreducible in BGTA screening.

I We use 100 permutations to control for family-wise error rate.

20 / 25

Examples

20 / 25

Examples

20 / 25

Examples

20 / 25

Examples

20 / 25

Examples

21 / 25

Examples

22 / 25

Examples

23 / 25

Examples

Current and Future effort

I We are analyzing a breast cancer whole-genome scan.

I We are develop gene-based analysis tools.

I For whole-genome association study, new methods andcomputational strategies are needed to accommodate thelarge number of SNPs.

24 / 25

Examples

24 / 25

Examples

24 / 25

Examples

THANK YOU!

25 / 25

constructing gene association networks for complex human disordes

Documents