saving the babies from the bathwater in genome-wide ...zzlab.net/workshopisu/gwas_iowa.pdf · mlm...
TRANSCRIPT
Saving the Babies from the Bathwater in Genome-Wide Association Studies
Zhiwu ZhangWashington State University
http://luckyrobot.com/wp-content/uploads/2013/04/nih-cost-genome.jpg
Human genome
2nd
Generation Sequencing
Biomedical Innovations out-paced Computer Innovations
More Research on GWAS and GS
By May 31, 2013
∗ As fast as one season∗ 50~300 kb resolution
∗ Computing difficulties: millions of markers, individuals, and traits
∗ False positives, ex: “Amgen scientists tried to replicate 53 high-profile cancer research findings, but could only replicate 6”, Nature, 2012, 483: 531
∗ False negatives
Problems in GWAS
GWAS Stream
GCTA
PCPC+K
SELECT
EMMA
FST-LMM
EMMAx
GEMMA
GenAbel
P3D
MLMM
Q
Q+K
CMLM
FarmCPUBLINK
ECMLM
0.1
NC33
B115
I 137TN
81- 1
M EF 156- 55- 2
I L677A
I a5125
I A2132P39I L101
F2EP1
F7
CO 255
NC366
B52
SC213R
NC238
G T112
M p339
G A209
M37WD940Y
T232
U267Y
CI 28A
B2
F2834T
M o24WM S1334
I DS28
I - 29
SA24
4722
SG 18Sg1533
HP301I DS91I DS69
F6
F44
Ab28A
CM L328
O h603
A441- 5
CM L323
SC55
NC264NC370
NC320
NC318
NC334
NC332
CM L92
CM L220
CM L218A272
Tzi9
CM L77
TZI 10
CM L311
CM L349CM L158Q
CM L154Q
CM L281
CM L91
par vi- 14
par vi- 36
par vi- 03
par vi- 49
par vi- 30
NC356
TX601
NC340
NC350
NC304
NC338NC302
NC354TZI 18
A6
Ki44
Ki43
Ki2007Ki21
Ki2021
Ki14
CM L238CM L321
CM L157Q
CM L322
CM L331CM L261
CML277
CM L341CM L45CM L11
CM L10
Q 6199
CM L314
CM L258
CM L5CM L254
CM L61
CM L264
CM L9
CM L108
CM L287Tzi11
CM L38CM L14
SC357L578
T234
N6E2558W
K64
CI 64
CI 44
CI 31ANC230
CI 66K55
R109BM oG
L317
NC232
VaW6
CH701- 30
Va85
CI 90C M 14
H99
Va99 PA91
Ky226
H95
O h40B
VA26Pa762
O H43
A619
Va22
Va17
Va14 Va59
Va102
Va35
T8
A654
Pa880
SD44
CH9
Pa875
H49
W64A
WF9
M o1W
O s420
R168
38- 11
NC260
M o44
CI 21E
Hy
A661
ND246
WD
A554
CO 125
CO 109
B75
DE- 3
DE- 2
DE1
B57
C123
C103
NC360
NC364
NC362
NC342
NC290A
NC262
NC344
NC258
CI 91B
CI 187- 2
A682
K4
NC222
NC236
CI 3A
K148
M t 42
M S153
W401
A556
B77
CM 37
CM V3
W117HT
CM 7
CO 106
B103
R4
Ky228
B164
M o46
M o47
M o45
Hi27
Yu796- NS
I 205
Tzi16Tzi25
B79
B105
N7A
SD40
N28HT
A641
DE811
H105W
A214N
H100
A635
A632
A634B14A
H91B68
B64
CM 174CM 105
B104
B84
NC250
H84
B76
B37
N192
A679
B109NC294
NC368NC326
NC314
NC328 R229
A680
NC310
NC324NC322NC330
NC372
B73Ht r hmB73NC308
NC312 NC306NC268
B46B10
A239
C49AC49
A188
W153R
A659
R177
W22
W182B
CI - 7
33- 16
4226
NC348
NC298
NC352
NC336
NC296
NC346NC296A
NC292
SWEET
POPCORNTROPICAL-SUBTROPICAL
STIFF STALKNON STIFF STALK
MIXED Based on 89 SSR loci
Flint-Garcia et al. (2005) Plant J. 44: 1054
B97
CML103CML69
CML52
CML228
CML247
CML332
IL14H
Ky21
Ki11
Ki3
MS71
Mo18W
OH7B
M162W
MO17
Oh43E
Tx303
TZI8CML333
NC358NC300
ssp. parviglumis
Populationstructure
Lines Q1 Q2 Q333-16 0.014 0.972 0.01438-11 0.003 0.993 0.0044226 0.071 0.917 0.0124722 0.035 0.854 0.111A188 0.013 0.982 0.005
A214N 0.762 0.017 0.221A239 0.035 0.963 0.002A272 0.019 0.122 0.859
A441-5 0.005 0.531 0.464A554 0.019 0.979 0.002A556 0.004 0.994 0.002
A6 0.003 0.03 0.967A619 0.009 0.99 0.001
GLMforGWAS
Phenotype on individuals
Population structure
Y = SNP +(fixed effect)
General Linear Model (GLM)
(fixed effect)eQ (or PCs) +
Additive numerator relationship
A B
C
D
E
Individual Father MotherABC A BD A CE D B
A B C D E
A 1 0 0.5 0.75 0.375
B 0 1 0.5 0.25 0.625
C 0.5 0.5 1 0.75 0.625
D 0.75 0.25 0.75 1.25 0.75
E 0.375 0.725 0.625 0.75 1.125
Diagonals=1+F
∗ Proportion of shared allele∗ Average across markers
Marker based kinship
Marker 1 2 3 4 5 Average
Individual1 AA AA AA BB AB
Individual2 AA AB BB BB AB
Similarity 1 0.5 0 1 0.5 0.6
Maximum similarity: 1
Kinship matrix (re-scaled)
MLMforGWAS
Phenotype
Population structure
Unequal relatedness
Y = SNP + Q (or PCs) + Kinship + e(fixed effect) (random effect)
General Linear Model (GLM)
Mixed Linear Model (MLM)
(fixed effect)
(Yu et al. 2005, Nature Genetics)
Atwell et al Nature 2010
a, No correction test
b, Correction with MLM
GWAS does work well for traits associated with structure
Magnus Norborg
Queen + King
MLM was more enriched on Flowering time genes
Groupbykinship
7100
7150
7200
7250
7300
7350
7400
74501 2 4 8 16 32 64 128
256
512
1024
CompressedMLMimprovespower
Average number of individuals per group
Zhang et al, Nature Genetics, 2010
SA, GC, PCA and QTDT
Henderson’s MLM
GLM (1 group)
Full MLM(n groups)
Pedigreebased kinship
Markerbased kinship
Com
pres
sed
MLM
(s
grou
ps)
Sire model
n ≥
s ≥
1
Unified MLM
Compressed MLM
Compressed MLM is more generalZhang et al, Nature Genetics, 2010
Enriched Compressed MLM
Kinship: Among individuals -> among groups
1 .25 .125 .125
.25 1 .5 .5
.125 .5 1 .75
.125 .5 .75 1
1 .167
.167 .72
1 .25
.25 1
…
Statistical power improvement
Method shift Human Dog Maize Arabidopsis
GLM to MLM 3.6% 13.8% 10.1% 29.6%
MLM to compression 4.0% 14.2% 7.6% 2.5%
Compression to group kinship
6.4% 13.3% 2.9% 2.6%
Meng Li
Li et al, BMC Biology, 2014
Kinship defined by single marker
S1 S2 S3 S4 R1 R2 R3 R4
S1 1 1 1 1 0 0 0 0
S2 1 1 1 1 0 0 0 0
S3 1 1 1 1 0 0 0 0
S4 1 1 1 1 0 0 0 0
R1 0 0 0 0 1 1 1 1
R2 0 0 0 0 1 1 1 1
R3 0 0 0 0 1 1 1 1
R4 0 0 0 0 1 1 1 1
Sensitive Resistance
Adding additional markers bluer the picture
Derivation of kinship
All SNPs
QTNs
Non-QTNs SNP
Kinship
Statistical power of kinship from
Sing
le t
rait
All
trai
ts
Remove QTN one at a time
Kinship evolution
Statistical power of kinship from
Bin approach
Mimic QTN-1 ∗ 1. Choose t associated SNPs as QTNs each represent an
interval of size s. ∗ 2. Build kinship from the t QTNs∗ 3. Optimization on t and s∗ 4. For a SNP, remove the QTNs in LD with the SNP, e.g. R
square > 1%∗ 5. Use the remaining QTNs to build kinship for testing the
SNP
Statistical power of kinship from
Qishan Wang
PLoS One, 2014
SUPER(Settlement of kinship Under Progressively Exclusive Relationship)
Usage of Software PackagesSoftware Leading Authors Corresponding authors Language Released Citation
PUMA Gabriel E. Hoffman Jason G. Mezey C++ 2013 8
TATES Sophie van der Sluis Sophie van der Sluis Fortran 2013 20
GAPIT Lipka AE Zhang Z R 2012 106
MLMM Vincent S Nordborg M R/python 2012 69
GEMMA Zhou X Stephens M C++ 2012 88
FastLMM Christoph L, Listgarten J,Heckerman D
Christoph L, Listgarten J,Heckerman D C++ 2011 104
Qxpak M. Pérez-Enciso M. Pérez-Enciso Fortran 2004 141
EMMAX Kang HM Sabatti C & Eskin E C++ 2010 349
GCTA Jian Y Jian Y C++ 2011 380
GenABEL Aulchenko YS Aulchenko YS R 2007 510
TASSEL Bradbury PJ, Zhang Z, KroonDE Bradbury PJ Java 2006 660
PLINK Purcell S Purcell S C++ 2007 703775%
Why human geneticists not go beyond PLINK?
Adjustment on marker
si: Testing marker
Q: Population structure
K: Kinship
S: Pseudo QTNs
Adjustment on covariates
GWAS Model Development
It is time for human geneticists to move forward
Xiaolei Liu
Re-analysis of Arabidopsis data
Xiaolei Liu
Asso
ciat
ions
on
flow
erin
g tim
e
Flowering time genes enriched
FarmCPU is computing efficient
Testing 60K SNPs
Half million individuals, half million SNPs three days
But, PINK new version is faster
ZZLab.Net
http://zzlab.net/GAPIT/data
BLINK is super fastTesting half million SNPs
BLINK is 5X > PLINK, 200X > FarmCPU
FarmCPU
BLINK
PLINK
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
fdr.mean[, 1]
power PLINKGAPIT
BLINK
FarmCPU
http://zzlab.net/GAPIT/data/Workshop_Iowa.R
t, F, X2…
GLM
MLM
CMLM
Uncorrelated or equally correlated
Ladder for high hanging fruits
ECMLM
Structure, Eigenstrate
PLIK
EMMA, EMMAx,GCAT, GenABEL
TASSELGAPIT
GAPIT
Acknowledgements
Meng HuangXiaolei LiuMeng LiAlex Lipka
ECMLM
Xiahui Yuan