sequential & multiple hypothesis testing procedures for genome-wide association scans qunyuan...
Post on 13-Dec-2015
217 Views
Preview:
TRANSCRIPT
Sequential & Multiple Hypothesis Sequential & Multiple Hypothesis Testing Procedures Testing Procedures
for Genome-wide Association Scansfor Genome-wide Association Scans
Qunyuan Zhang Qunyuan Zhang
Division of Statistical GenomicsDivision of Statistical GenomicsWashington University School of MedicineWashington University School of Medicine
22
Multiple ComparisonMultiple Comparison(strategy 1)(strategy 1)Type I error
False Positive
Type II errorFalse negative
Highpower
Lowpower
P value adjustment/correction (Bonferroni, FDR)
Empirical p value (permutation, bootstrap)
33
Type I errorFalse Positive
Type II errorFalse negative
Multiple ComparisonMultiple Comparison(strategy 2)(strategy 2) Larger sample size
Meta analysis
Biological info or evidence
……
More powerful statistical approach
SMDP: Sequential Multiple SMDP: Sequential Multiple Decision ProcedureDecision Procedure
44
What is SMDP?What is SMDP?
A generalized framework for ranking and selection, using optimum sample sizes
A combination of sequential analysis and multiple hypothesis test
55
Feature 1 of SMDPFeature 1 of SMDP
Sequential AnalysisSequential Analysis
nn00Start from a small sample size
Increase sample size, sequential test at each stage
Stop when stopping rule is satisfied
nn00+1+1
nn00+2+2
nn00+i+i
……
66
Feature 2 of SMDP Feature 2 of SMDP
Multiple DecisionMultiple Decision
SNP1SNP1
SNP2SNP2
SNP3SNP3
SNP4SNP4
SNP5SNP5
SNP6SNP6
……
SNPnSNPn
Simultaneous testSimultaneous testMultiple hypothesis testMultiple hypothesis test Independent testIndependent test
Binary hypothesis testBinary hypothesis test test 1
test 2
test 3
test 4
test 5
test 6
test n
SNP1SNP1
SNP2SNP2
SNP3SNP3
SNP4SNP4
SNP5SNP5
SNP6SNP6
……
SNPnSNPn
Signal Signal group group
Noise Noise group group
77
Binary Hypothesis TestBinary Hypothesis Testused by traditional methods used by traditional methods
SNP1SNP1
SNP2SNP2
SNP3SNP3
SNP4SNP4
SNP5SNP5
SNP6SNP6
……
SNPnSNPn
test 1 H0: Eff.(SNP1)=0 vs. H1: Eff.(SNP1)≠0
test 2 H0: Eff.(SNP2)=0 vs. H1: Eff.(SNP2)≠0
test 3 ……
test 4 ……
test 5 ……
test 6 ……
test n H0: Eff.(SNPn)=0 vs. H1: Eff.(SNPn)≠0
test-wise error and genome-wise error
multiple testing issue
88
Multiple Hypothesis TestMultiple Hypothesis Testused by SMDPused by SMDP
SNP1SNP1
SNP2SNP2
SNP3SNP3
SNP4SNP4
SNP5SNP5
SNP6SNP6
……
SNPnSNPn
H1: SNP1,2,3 are truly different from the others
H2: SNP1,2,4 are truly different from the others
H3 ……
H4 ……
H5: SNP4,5,6 are truly different from the others
H6 ……
……
Hu: SNPn,n-1,n-2 are truly different from the others
Goal: search the best one
H: any t SNPs are truly different from the others (n-t)
u= number of all possible combination of t out of n
99
General Rule of SMDP General Rule of SMDP (Bechhofer et al., 1968)(Bechhofer et al., 1968)
Selecting the Selecting the t t best of best of MM K-D populations K-D populations
Sequential Sampling
1 2 … h h+1 …
Pop. 1
Pop. 2
:
Pop. t-1
Pop. t
Pop. k+1
Pop. k+2
:
Pop. M
D
Y1,h
Y2,h
:
:
Yt,h
:
::
:
YM,h
U
j
thj
thU
hU
YD
YDW
1
)exp(
)exp(
)(],[
*
)(],[
*
],[
)!(!
!
tMt
MU
U possible combinations
of t out of M
t
ihi
thu k
YY1
,)(
,
For each combination u
)(],[
)(],[
)(],[
)(],[ ... t
hUt
hUt
hth YYYY 121
*],[ PW hU Stopping rule
Prob. of correct selection (PCS) > P*, whenever D>D*
Sequential statistic at stage h
1010
jijiD
2
1
2
1: ,
Koopman-Darmois(K-D) PopulationsKoopman-Darmois(K-D) Populations (Bechhofer et al., 1968)(Bechhofer et al., 1968)
The freq/density function of a K-D population can be written in the form:
f(x)=exp{P(x)Q(θ)+R(x)+S(θ)}
A. The normal density function with unknown mean and known variance;
B. The normal density function with unknown variance and known mean;
C. The exponential density function with unknown scale parameter and known location parameter;
D. The Poisson distribution with unknown mean;
……
The distance of two K-D populations
)()(, jiji QQ
1111
1212
Combine SMDP With Regression ModelCombine SMDP With Regression Model(M.A. Province, 2000, page 319)(M.A. Province, 2000, page 319)
),(~
)ˆˆ( )()(
2111
111
0
NVrV
XZr
XZ
hhh
hhh
hh
Case B : the normal density function with unknown variance and known mean;
h
jjihi VY
1
2,,
1313
SMDP - Regression SMDP - Regression (M.A. Province, 2000)(M.A. Province, 2000)
Z1 , X1
Z2 , X2
Z3 , X3
: :
Zh , Xh
Zh+1 , Xh+1
: :
ZN , XN
Data pairs for a marker
Sequential sum of squares of regression residualsYi,h denotes Y for marker i at stage h (see slide 7)
1h
1j
2j1h
21h1h1h
h
1j
21hj
h
1j
2)h(j
h
1j
2)h(j
1h
1h)h()h(
1h1h
VY
),0(N~VrV
)XX()XX(h
)XX(h
)Xˆˆ(Zr
XZ
1414
A Real Data Example (A Real Data Example (M.A. Province, 2000, page 308)M.A. Province, 2000, page 308)
1515
Simulation Results Simulation Results M.A. Province, 2000, page 312M.A. Province, 2000, page 312
1616
SMDP: SMDP: Computational ProblemComputational Problem
)t(h],U[
)t(h],1U[
)t(h],2[
)t(h],1[
*U
1j
)t(h],j[
*
)t(h],U[
*
h],U[
YY...YY
P)YDexp(
)YDexp(W
1
2
3
:
h
h+1
:
N
Sequential stage
Y1,h
Y2,h
:
Yk,h
Yk+1,h
Yk+2,h
:
YM,h
U sums of U possible combinations of t out of MEach sum contains t members of Yi,h
)!tM(!t
!MU
Computer time
?
1717
h],U[]1U[
h],U[]2U[
h],U[]2[
h],U[]1[
h],U[
)t(h],U[
)t(h],1U[
)t(h],1S[
)t(h],S[
)t(h],2[
)t(h],1[
*U
Sj
)t(h],j[
*)t(h],S[
*
)t(h],U[
*]SU[
h],U[
WWW...WW
YY...YY...YY
P)YDexp()YDexp()1S(
)YDexp(W
Simplified Stopping RuleSimplified Stopping Rule
U-S+1= Top Combination Number (TCN)
TCN=2 (i.e. S=U-1,U-S=1)=> the simplest stopping rule
}P1
P)1U(ln{
D
1YY
*
*
*h],tM[h],1tM[
When TCN=U (i.e. S=1, U-S=U-1)=> the original stopping rule
How to choose TCN? Balance between computational accuracy and computational timeZhang & Province, 2005
1818
1919
Application to Pharmacal Genetics DataApplication to Pharmacal Genetics Data
Sample Sample sizesize
GenotypeGenotype PhenotypePhenotype
8585
Cell Cell lineslines
5841 SNPs5841 SNPs ViabFu7ViabFu7
P*=0.95P*=0.95D*=10D*=10TCN=10000TCN=10000
72 SNPs72 SNPsP<0.01P<0.01
2020
SMDP for GAWSSMDP for GAWS
Some technical/programming problems
1. Computer time (approximation & parallelization) 2. Missing data3. Stability at early stage4. Rare SNPs
Now SMDP can done for an analysis of GWAS data (500K chip, 1000 subjects) within 10 hours via cluster
2121
Simulation 1Simulation 1
5000 SNPs5000 SNPs1 true signal1 true signal
500 replications500 replications
2222
Simulation 2: Simulation 2: Multiple signalsMultiple signals
Genotype data: GAW16 problem 3, 500K SNP data; Phenotype data: Simulated LDL (measured at the first visit), ~6500 subjects, 200 replicationsAnalyses: For each replication, randomly draw 1000 SNPs without true effects and 10 SNPs with minor poly-gene effects and keep all 6 SNPs with relatively major effects to create a subset of genotypes. Recode the genotypes to 0, 1 and 2 according the copy number of minor alleles; Apply SMDP to the selected data and repeat the analysis over 200 replications.
2323
Modified SMDP(analysis procedure)
(1) Start analysis (or experiment) from a small sample size;
(2) Perform multiple decision analysis to simultaneously test if a group of makers are significant;
(3) Eliminate significant markers from the list (if identified);
(4) Add one or multiple new samples to the data;
(5) Repeat (2),(3),(4) …
(6) Stop the procedure when all samples have been used and no makers are identified any more .
2424
ROC Curves of SMDP and Regular Regression Analyses
Ar, Br : Regular regression using all samples
As, Bs: SMDP analyses
Ars, Brs: Regular regression using SMDP’s average sample sizes (ASN)
Ar, As and Ars: Analysis of SNPs with major effects;
Br, Bs and Brs: Anaysis of SNPs with minor effects.
ASN: the average sample size used in SMDP, presented as proportion of the entire sample size.
2525
Power comparison of SMPD and regular regression(type I error rate = 0.0025)
SNPs with true effects
Simulatedh2
SMDP
Power of regular regression using ASNpower ASN* Validation*
rs7672287 0.003 0.46 4432 0.26 0.40
rs1466535 0.002 0.80 4370 0.50 0.74
rs901824 0.001 0.00 NA NA 0.00
rs10910457 0.005 0.74 4509 0.44 0.73
rs4648068 0.007 0.04 5550 0.00 0.05
rs2294207 0.010 1.00 2077 1.00 0.47
*Proportion of significant tests (P<0.05), based on regression using the rest of samples after SMDP stops.*ASN: Average sample number used in SMDP
Conclusion: given the same sample size, SMDP-regression is more powerful than regular regression.
2626
The NHLBI Family Heart StudyIllumina HuamanMap550 array data983 subjectsCoronary Artery Calcification (CAC)
SMDP identifies 69 SNPs using less than 811 samples
Traditional regression analysis of all 983 samples identifies46122 SNPs (p<0.05)15 SNPs (FDR<0.05) 11 identified by SMDP1 SNPs (p<0.05/500K) also identified by SMDP
Application to Real Data
2727
Efficient use of sample size, extra sample size after stopping can be used for validation
Simultaneously test group of signals, avoid one-by-one test and p-value adjustment
Increase power (or decrease false positives) given the same average sample size
Flexible experimental design. Extra N
Summary of SMDP(advantages)
2828
Compute time (needs approximation & parallelization )
Requirement of Koopman-Darmois distribution family
Summary of SMDP(limitations)
2929
SMDP: SMDP: P*, t, D*P*, t, D*
P* P* arbitrary, 0.95arbitrary, 0.95
t fixed or variedt fixed or varied
D* indifference zone D* indifference zone
Pop. 1
Pop. 2
:
Pop. t-1
Pop. t
Pop. t+1 Pop. t+2
:
:
:
Pop. M
*)exp(
)exp(
)(],[
*
)(],[
*
],[ PYD
YDW
U
j
thj
thU
hU
1
SMDP stopping rule
Prob. of correct selection (PCS) > P*whenever D>D*
Correct selection Populations with Q(θ)> Q(θt)+D* are selected
D*
Q(θt)+D*
Q(θt)
3030
ReferencesReferences
R.E. Bechhofer, J. Kiefer., M. Sobel. 1968. Sequential identification and ranking procedures. The University of Chicago Press, Chicago.
M.A. Province. 2000. A single, sequential, genome-wide test to identify simultaneously all promising areas in a linkage scan. Genetic Epidemiology,19:301-332 .
Q. Zhang, M.A. Province . 2005. Simplified sequential multiple decision procedures for genome scans . 2005 Proceedings of American Statistical Association. Biometrics section:463~468
3131
Application to GWASApplication to GWAS
slide 9
slide 10
3232
h],U[]1U[
h],U[]2U[
h],U[]2[
h],U[]1[
h],U[
)t(h],U[
)t(h],1U[
)t(h],1S[
)t(h],S[
)t(h],2[
)t(h],1[
*U
Sj
)t(h],j[
*)t(h],S[
*
)t(h],U[
*]SU[
h],U[
WWW...WW
YY...YY...YY
P)YDexp()YDexp()1S(
)YDexp(W
Simplified Stopping RuleSimplified Stopping Rule
U-S+1= Top Combination Number (TCN)
TCN=2 (i.e. S=U-1,U-S=1)=> the simplest stopping rule
}P1
P)1U(ln{
D
1YY
*
*
*h],tM[h],1tM[
When TCN=U (i.e. S=1, U-S=U-1)=> the original stopping rule
How to choose TCN? Balance between computational accuracy and computational timeZhang & Province, 2005
3333
Zhang & Province,2005,page 467Zhang & Province,2005,page 467
P*=0.95P*=0.95D*=10D*=10TCN=10000TCN=10000
72 SNPs72 SNPsP<0.01P<0.01
3434
Simplified Stopping Rule Simplified Stopping Rule M.A. Province, 2000 M.A. Province, 2000
page 321-322 page 321-322
3535
A Real Data Example (A Real Data Example (M.A. Province, 2000, page 310)M.A. Province, 2000, page 310)
3636
Simulation Results (2) Simulation Results (2) M.A. Province, 2000, page 313M.A. Province, 2000, page 313
3737
h],U[]1U[
h],U[]2U[
h],U[]2[
h],U[]1[
h],U[
)t(h],U[
)t(h],1U[
)t(h],1S[
)t(h],S[
)t(h],2[
)t(h],1[
*U
Sj
)t(h],j[
*)t(h],S[
*
)t(h],U[
*]SU[
h],U[
WWW...WW
YY...YY...YY
P)YDexp()YDexp()1S(
)YDexp(W
Simplified SMDPSimplified SMDP (Bechhofer et al., 1968)(Bechhofer et al., 1968)
U-S+1= Top Combination Number (TCN)
How to choose TCN?
Balance between computational accuracy and computational time
3838
Relation of Relation of WW and and t t (h=50, D*=10)(h=50, D*=10)
Effective Top Combination Number
ETCN
Zhang & Province,2005,page 465Zhang & Province,2005,page 465
3939
ETCN CurveETCN Curve
Zhang & Province,2005,page 466Zhang & Province,2005,page 466
4040
t t =?=?
Zhang & Province,2005,page 466Zhang & Province,2005,page 466
4141
SMDP SummarySMDP Summary
Advantages:Advantages:
Test, identify all signals simultaneously, no multiple comparisons Test, identify all signals simultaneously, no multiple comparisons
Use “Minimal” N to find significant signals, efficient Use “Minimal” N to find significant signals, efficient
Tight control statistical errors (Type I, II), powerfulTight control statistical errors (Type I, II), powerful
Save rest of N for validation, reliableSave rest of N for validation, reliable
Further studies:Further studies:
Computer time Computer time
Extension to more methods/modelsExtension to more methods/models
Extension to non-K-D distributionsExtension to non-K-D distributions
top related