sequential & multiple hypothesis testing procedures for genome-wide association scans qunyuan...

Sequential & Multiple Hypothesis Sequential & Multiple Hypothesis Testing Procedures Testing Procedures

for Genome-wide Association Scansfor Genome-wide Association Scans

Qunyuan Zhang Qunyuan Zhang

Division of Statistical GenomicsDivision of Statistical GenomicsWashington University School of MedicineWashington University School of Medicine

Multiple ComparisonMultiple Comparison(strategy 1)(strategy 1)Type I error

False Positive

Type II errorFalse negative

Highpower

Lowpower

P value adjustment/correction (Bonferroni, FDR)

Empirical p value (permutation, bootstrap)

Type I errorFalse Positive

Type II errorFalse negative

Multiple ComparisonMultiple Comparison(strategy 2)(strategy 2) Larger sample size

Meta analysis

Biological info or evidence

……

More powerful statistical approach

SMDP: Sequential Multiple SMDP: Sequential Multiple Decision ProcedureDecision Procedure

What is SMDP?What is SMDP?

A generalized framework for ranking and selection, using optimum sample sizes

A combination of sequential analysis and multiple hypothesis test

Feature 1 of SMDPFeature 1 of SMDP

Sequential AnalysisSequential Analysis

nn00Start from a small sample size

Increase sample size, sequential test at each stage

Stop when stopping rule is satisfied

nn00+1+1

nn00+2+2

nn00+i+i

……

Feature 2 of SMDP Feature 2 of SMDP

Multiple DecisionMultiple Decision

SNP1SNP1

SNP2SNP2

SNP3SNP3

SNP4SNP4

SNP5SNP5

SNP6SNP6

……

SNPnSNPn

Simultaneous testSimultaneous testMultiple hypothesis testMultiple hypothesis test Independent testIndependent test

Binary hypothesis testBinary hypothesis test test 1

test 2

test 3

test 4

test 5

test 6

test n

SNP1SNP1

SNP2SNP2

SNP3SNP3

SNP4SNP4

SNP5SNP5

SNP6SNP6

……

SNPnSNPn

Signal Signal group group

Noise Noise group group

Binary Hypothesis TestBinary Hypothesis Testused by traditional methods used by traditional methods

SNP1SNP1

SNP2SNP2

SNP3SNP3

SNP4SNP4

SNP5SNP5

SNP6SNP6

……

SNPnSNPn

test 1 H0: Eff.(SNP1)=0 vs. H1: Eff.(SNP1)≠0

test 2 H0: Eff.(SNP2)=0 vs. H1: Eff.(SNP2)≠0

test 3 ……

test 4 ……

test 5 ……

test 6 ……

test n H0: Eff.(SNPn)=0 vs. H1: Eff.(SNPn)≠0

test-wise error and genome-wise error

multiple testing issue

Multiple Hypothesis TestMultiple Hypothesis Testused by SMDPused by SMDP

SNP1SNP1

SNP2SNP2

SNP3SNP3

SNP4SNP4

SNP5SNP5

SNP6SNP6

……

SNPnSNPn

H1: SNP1,2,3 are truly different from the others

H3 ……

H4 ……

H6 ……

……

Hu: SNPn,n-1,n-2 are truly different from the others

Goal: search the best one

H: any t SNPs are truly different from the others (n-t)

u= number of all possible combination of t out of n

General Rule of SMDP General Rule of SMDP (Bechhofer et al., 1968)(Bechhofer et al., 1968)

Selecting the Selecting the t t best of best of MM K-D populations K-D populations

Sequential Sampling

1 2 … h h+1 …

Pop. 1

Pop. 2

Pop. t-1

Pop. t

Pop. k+1

Pop. k+2

Pop. M

U possible combinations

of t out of M

For each combination u

)(],[ ... t

hth YYYY 121

*],[ PW hU Stopping rule

Prob. of correct selection (PCS) > P*, whenever D>D*

Sequential statistic at stage h

Koopman-Darmois(K-D) PopulationsKoopman-Darmois(K-D) Populations (Bechhofer et al., 1968)(Bechhofer et al., 1968)

The freq/density function of a K-D population can be written in the form:

f(x)=exp{P(x)Q(θ)+R(x)+S(θ)}

A. The normal density function with unknown mean and known variance;

B. The normal density function with unknown variance and known mean;

C. The exponential density function with unknown scale parameter and known location parameter;

D. The Poisson distribution with unknown mean;

……

The distance of two K-D populations

)()(, jiji QQ

Combine SMDP With Regression ModelCombine SMDP With Regression Model(M.A. Province, 2000, page 319)(M.A. Province, 2000, page 319)

)ˆˆ( )()(

Case B : the normal density function with unknown variance and known mean;

jjihi VY

SMDP - Regression SMDP - Regression (M.A. Province, 2000)(M.A. Province, 2000)

Z1 , X1

Z2 , X2

Z3 , X3

Zh , Xh

Zh+1 , Xh+1

ZN , XN

Data pairs for a marker

Sequential sum of squares of regression residualsYi,h denotes Y for marker i at stage h (see slide 7)

21h1h1h

1h)h()h(

),0(N~VrV

)XX()XX(h

)Xˆˆ(Zr

A Real Data Example (A Real Data Example (M.A. Province, 2000, page 308)M.A. Province, 2000, page 308)

Simulation Results Simulation Results M.A. Province, 2000, page 312M.A. Province, 2000, page 312

SMDP: SMDP: Computational ProblemComputational Problem

)t(h],U[

)t(h],1U[

)t(h],2[

)t(h],1[

)t(h],j[

)t(h],U[

YY...YY

P)YDexp(

)YDexp(W

Sequential stage

Yk+1,h

Yk+2,h

U sums of U possible combinations of t out of MEach sum contains t members of Yi,h

)!tM(!t

Computer time

h],U[]1U[

h],U[]2U[

h],U[]2[

h],U[]1[

)t(h],U[

)t(h],1U[

)t(h],1S[

)t(h],S[

)t(h],2[

)t(h],1[

)t(h],j[

*)t(h],S[

)t(h],U[

WWW...WW

YY...YY...YY

P)YDexp()YDexp()1S(

)YDexp(W

Simplified Stopping RuleSimplified Stopping Rule

U-S+1= Top Combination Number (TCN)

TCN=2 (i.e. S=U-1,U-S=1)=> the simplest stopping rule

P)1U(ln{

*h],tM[h],1tM[

When TCN=U (i.e. S=1, U-S=U-1)=> the original stopping rule

How to choose TCN? Balance between computational accuracy and computational timeZhang & Province, 2005

Application to Pharmacal Genetics DataApplication to Pharmacal Genetics Data

Sample Sample sizesize

GenotypeGenotype PhenotypePhenotype

Cell Cell lineslines

5841 SNPs5841 SNPs ViabFu7ViabFu7

P*=0.95P*=0.95D*=10D*=10TCN=10000TCN=10000

72 SNPs72 SNPsP<0.01P<0.01

SMDP for GAWSSMDP for GAWS

Some technical/programming problems

1. Computer time (approximation & parallelization) 2. Missing data3. Stability at early stage4. Rare SNPs

Now SMDP can done for an analysis of GWAS data (500K chip, 1000 subjects) within 10 hours via cluster

Simulation 1Simulation 1

5000 SNPs5000 SNPs1 true signal1 true signal

500 replications500 replications

Simulation 2: Simulation 2: Multiple signalsMultiple signals

Genotype data: GAW16 problem 3, 500K SNP data; Phenotype data: Simulated LDL (measured at the first visit), ~6500 subjects, 200 replicationsAnalyses: For each replication, randomly draw 1000 SNPs without true effects and 10 SNPs with minor poly-gene effects and keep all 6 SNPs with relatively major effects to create a subset of genotypes. Recode the genotypes to 0, 1 and 2 according the copy number of minor alleles; Apply SMDP to the selected data and repeat the analysis over 200 replications.

Modified SMDP(analysis procedure)

(1) Start analysis (or experiment) from a small sample size;

(2) Perform multiple decision analysis to simultaneously test if a group of makers are significant;

(3) Eliminate significant markers from the list (if identified);

(4) Add one or multiple new samples to the data;

(5) Repeat (2),(3),(4) …

(6) Stop the procedure when all samples have been used and no makers are identified any more .

ROC Curves of SMDP and Regular Regression Analyses

Ar, Br : Regular regression using all samples

As, Bs: SMDP analyses

Ars, Brs: Regular regression using SMDP’s average sample sizes (ASN)

Ar, As and Ars: Analysis of SNPs with major effects;

Br, Bs and Brs: Anaysis of SNPs with minor effects.

ASN: the average sample size used in SMDP, presented as proportion of the entire sample size.

Power comparison of SMPD and regular regression(type I error rate = 0.0025)

SNPs with true effects

Simulatedh2

Power of regular regression using ASNpower ASN* Validation*

rs7672287 0.003 0.46 4432 0.26 0.40

rs1466535 0.002 0.80 4370 0.50 0.74

rs901824 0.001 0.00 NA NA 0.00

rs10910457 0.005 0.74 4509 0.44 0.73

rs4648068 0.007 0.04 5550 0.00 0.05

rs2294207 0.010 1.00 2077 1.00 0.47

*Proportion of significant tests (P<0.05), based on regression using the rest of samples after SMDP stops.*ASN: Average sample number used in SMDP

Conclusion: given the same sample size, SMDP-regression is more powerful than regular regression.

The NHLBI Family Heart StudyIllumina HuamanMap550 array data983 subjectsCoronary Artery Calcification (CAC)

SMDP identifies 69 SNPs using less than 811 samples

Traditional regression analysis of all 983 samples identifies46122 SNPs (p<0.05)15 SNPs (FDR<0.05) 11 identified by SMDP1 SNPs (p<0.05/500K) also identified by SMDP

Application to Real Data

Efficient use of sample size, extra sample size after stopping can be used for validation

Simultaneously test group of signals, avoid one-by-one test and p-value adjustment

Increase power (or decrease false positives) given the same average sample size

Flexible experimental design. Extra N

Summary of SMDP(advantages)

Compute time (needs approximation & parallelization )

Requirement of Koopman-Darmois distribution family

Summary of SMDP(limitations)

SMDP: SMDP: P*, t, D*P*, t, D*

P* P* arbitrary, 0.95arbitrary, 0.95

t fixed or variedt fixed or varied

D* indifference zone D* indifference zone

Pop. 1

Pop. 2

Pop. t-1

Pop. t

Pop. t+1 Pop. t+2

Pop. M

*)exp(

],[ PYD

SMDP stopping rule

Prob. of correct selection (PCS) > P*whenever D>D*

Correct selection Populations with Q(θ)> Q(θt)+D* are selected

Q(θt)+D*

Q(θt)

ReferencesReferences

R.E. Bechhofer, J. Kiefer., M. Sobel. 1968. Sequential identification and ranking procedures. The University of Chicago Press, Chicago.

M.A. Province. 2000. A single, sequential, genome-wide test to identify simultaneously all promising areas in a linkage scan. Genetic Epidemiology,19:301-332 .

Q. Zhang, M.A. Province ． 2005. Simplified sequential multiple decision procedures for genome scans ． 2005 Proceedings of American Statistical Association. Biometrics section:463~468

Application to GWASApplication to GWAS

h],U[]1U[

h],U[]2U[

h],U[]2[

h],U[]1[

)t(h],U[

)t(h],1U[

)t(h],1S[

)t(h],S[

)t(h],2[

)t(h],1[

)t(h],j[

*)t(h],S[

)t(h],U[

WWW...WW

YY...YY...YY

P)YDexp()YDexp()1S(

)YDexp(W

Simplified Stopping RuleSimplified Stopping Rule

TCN=2 (i.e. S=U-1,U-S=1)=> the simplest stopping rule

P)1U(ln{

*h],tM[h],1tM[

When TCN=U (i.e. S=1, U-S=U-1)=> the original stopping rule

How to choose TCN? Balance between computational accuracy and computational timeZhang & Province, 2005

Zhang & Province,2005,page 467Zhang & Province,2005,page 467

P*=0.95P*=0.95D*=10D*=10TCN=10000TCN=10000

72 SNPs72 SNPsP<0.01P<0.01

Simplified Stopping Rule Simplified Stopping Rule M.A. Province, 2000 M.A. Province, 2000

322 page 321-322

A Real Data Example (A Real Data Example (M.A. Province, 2000, page 310)M.A. Province, 2000, page 310)

Simulation Results (2) Simulation Results (2) M.A. Province, 2000, page 313M.A. Province, 2000, page 313

h],U[]1U[

h],U[]2U[

h],U[]2[

h],U[]1[

)t(h],U[

)t(h],1U[

)t(h],1S[

)t(h],S[

)t(h],2[

)t(h],1[

)t(h],j[

*)t(h],S[

)t(h],U[

WWW...WW

YY...YY...YY

P)YDexp()YDexp()1S(

)YDexp(W

Simplified SMDPSimplified SMDP (Bechhofer et al., 1968)(Bechhofer et al., 1968)

How to choose TCN?

Balance between computational accuracy and computational time

Relation of Relation of WW and and t t (h=50, D*=10)(h=50, D*=10)

Effective Top Combination Number

ETCN CurveETCN Curve

t t =?=?

SMDP SummarySMDP Summary

Advantages:Advantages:

Test, identify all signals simultaneously, no multiple comparisons Test, identify all signals simultaneously, no multiple comparisons

Use “Minimal” N to find significant signals, efficient Use “Minimal” N to find significant signals, efficient

Tight control statistical errors (Type I, II), powerfulTight control statistical errors (Type I, II), powerful

Save rest of N for validation, reliableSave rest of N for validation, reliable

Further studies:Further studies:

Computer time Computer time

Extension to more methods/modelsExtension to more methods/models

Extension to non-K-D distributionsExtension to non-K-D distributions

sequential & multiple hypothesis testing procedures for genome-wide association scans qunyuan...

sequential test

multiple hypothesis

snp20 test

snp10 test

test n h0

n slide

snpn0 testwise error

combination u

Documents

scans show

sharpening scans

media scans

my design scans

landscape - scans

scans te 2014

medical imaging x-rays ct scans mris ultrasounds pet scans

qunyuan zhang(1), li ding(2), aldi kraja(1) ingrid...

dna copy number analysis qunyuan zhang,ph.d. division of...

scans - unodc.org · title: scans author: zauner subject:...

1 associating genomic variations with phenotypes model...

cat scans 2

qunyuan zhang, ingrid borecki, michael a. province ...

scans-ii monitoring...

scans. filescans

spect/pet scans

1 haplotyping algorithms qunyuan zhang division of...

tree scans

dna copy number analysis qunyuan zhang, ph.d. division of...

population stratification qunyuan zhang division of...