introduction to mendelian randomization...zhiguang huo department of biostatistics university of...

37
Introduction to Mendelian Randomization Zhiguang Huo Department of Biostatistics University of Florida Febuary 26 th , 2020 1 / 34

Upload: others

Post on 13-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Introduction to Mendelian Randomization

    Zhiguang Huo

    Department of BiostatisticsUniversity of Florida

    Febuary 26th, 2020

    1 / 34

  • What is Mendelian randomization (MR)

    I Use inherited genetic variants to infer causal relationship of anexposure and an outcome.

    2 / 34

  • Outline

    I Motivation

    I Concepts and goals

    I Assumption

    I Mathematical theory

    I Model diagnostic

    I Method

    I One sample MRI Two sample MR

    3 / 34

  • Motivating example

    I Goal: investigate the effect of alcohol consumption on bloodpressure.

    I Observational studies have shown higher alcohol consumptionwas associated with higher blood pressure (Marmot et al.,1994; Fuchs et al., 2001).

    I This association could not imply causal effect because ofconfounders.

    I Smoking increases alcohol assumption.I Smoking increases blood pressure.

    4 / 34

  • Motivation

    I Observational study is very popular in biomedical studies.

    I The association from observational study does not implycausality, because of confounding variables.

    I For known confounders, we can adjust them as covariates.I For unknown confounders:

    I Randomized clinical trials (RCT)

    5 / 34

  • Randomized clinical trials (RCT)

    I RCT will lead to causal relationships between alcoholconsumption and blood pressure.

    I Drawbacks:I Time consumingI Loss of follow-up participants

    6 / 34

  • Idea of Mendelian randomization

    I Alcohol is initially metabolised to acetaldehyde, which can befurther eliminated (Davies et al., 2018).

    I The major enzyme for this elimination is alcoholdehydrogenase 2 (ALDH2).

    I A variant in the ALDH2 gene (Chen et al., 2008) (rs671,reference allele G)

    I Alternative allele A was found in east Asian populationI Causes a facial flush response and slows the metabolism of

    acetaldehyde

    I In a study of 4,057 participants (Takagi et al., 2001)I Those with two copies of A drank an average of 1.1 g of

    alcohol.I Those with no copies of A allele drank 23.7 g.

    7 / 34

  • Ideas behind MR

    I The genetic variants are inherited from parentsI This genetic variant is not affected by confounding variables

    (smoking)I This genetic variant is not affected by blood pressure level.

    I The genetic variant can define groups of different level ofalcohol consumption

    I If allele A non-carriers drank heavy, and had higher bloodpressure

    I If allele A carriers drank light, and had lower blood pressureI Genetic variants can be thought of random allocation

    I Then we can conclude the effect of alcohol consumption onblood pressure is causal.

    I The relationship is not likely to be confounded.I It is not likely that blood pressure causes alcohol consumption

    (reverse caution)

    8 / 34

  • Conceptual analogy between MR and randomized clinicaltrials (RCT)

    9 / 34

  • Mendelian randomization (MR)

    I Idea: If we cannot randomize the exposure, we can find arandomized instrumental variable to disentangle.

    I Genetic variants are also referred as instrumental variables

    I The original idea was first proposed as economy field, whichalso called instrument variable (IV) regression.

    Goals of MR studies:

    1. Test the existence of the causal relationship between theexposure variable and the outcome variable.

    2. Estimate the magnitude of the causal effect of the exposurevariable on the outcome variable.

    10 / 34

  • Notations

    I G: genetic variant

    I Y: outcome variable

    I X: exposure variable

    I U: unknown confounders

    11 / 34

  • Three core assumptions for hypothesis testing

    G: genetic variant; Y: outcome variable; X: exposure variable; U:unknown confounders

    1. Independence between G and U

    G ⊥ U

    2. Established association between G and X

    P(X |G ) 6= P(X )

    3. No alternative pathway from G to Y , (exclusion restriction)

    G ⊥ Y |X ,U

    I Theorem: testing G − Y association is equivalent to testingcausal relationship Y − X .

    12 / 34

  • Testing causal relationship (Didelez et al., 2010)

    P(Y ,G ) =

    ∫U

    ∫XP(Y ,X ,U,G )

    =

    ∫U

    ∫XP(Y |X ,U)P(X |G ,U)P(U)P(G )

    = P(G )

    ∫UP(U)

    ∫XP(Y |X ,U)P(X |G ,U)

    If Y ⊥ X |U, i.e., P(Y |X ,U) = P(Y |U),

    P(Y ,G ) = P(G )

    ∫UP(U)P(Y |U)

    ∫XP(X |G ,U)

    = P(G )P(Y )

    I Therefore, Y ⊥ X |U → Y ⊥ GI Under the pre-mentioned assumptions, we only need to test

    whether Y and G are independent, in order to establish causalrelationship between X and Y

    13 / 34

  • Estimating causal effect in linear models

    I Two more assumptions for linear regression:I The effect of X on Y is linear.I No interaction between X and U.

    I Suppose data generating models are

    X = α0 + α1G + α2U + ε1

    Y = β0 + β1X + β2U + ε2

    I We can obtain the following relationship

    E(X |G ) = α0 + α1GE(Y |G ) = θ0 + θ1G

    14 / 34

  • IV estimators are essentially ratio estimators

    I Since

    θ1 = E[Y |G = g + 1]− E[Y |G = g ]= β1(E[X |g + 1]− E[X |g ]) + β2(E[U|g + 1]− E[U|g ])= β1α1

    I Therefore, β1 = θ1/α1I When X ∈ R, G ∈ R (one exposure variable and one

    instrument variable), the IV estimator can be written as theratio of two OLS estimator

    β̂IV =θ̂1α̂1

    I The se of β̂IV can be determined by delta method (Wald,1940).

    15 / 34

  • Instrumental Variable estimation in linear modelsI Suppose G ∈ Rn×l and X ∈ Rn×p have same dimension (i.e.,

    p = l , both may contain intercept), and confounder U isabsorbed in the error ε

    Y = Xβ + εI The usual OLS does not give unbiased estimation for

    unconfounded effect, because X and ε are correlated.

    X>Y = X>Xβ + X>εI If the instrument G is independent of error ε

    G>Y = G>Xβ + G>ε

    β̂IV = (G>X )−1G>Y

    √n(β̂IV − β) ∼ N(0, σ2Q−1GXQGGQ

    −1XG ),

    where QGX = limn→∞

    G>Xn , QGG = limn→∞

    G>Gn

    16 / 34

  • Connection with the ratio estimator

    Suppose X = (1,X ) ∈ Rn×2, G = (1, g) ∈ Rn×2

    β̂IV = (β̂0, β̂IV )>

    = (G>X )−1G>Y

    = (G>X )−1(G>G )(G>G )−1G>Y

    = {(G>G )−1(G>X )}−1{(G>G )−1G>Y }

    It can be verified that

    β̂IV =θ̂1α̂1,

    where θ1 is the slope of regressing Y on g , α1 is the slope ofregressing X on g .

    17 / 34

  • Generalized methods of momentWhat if G ∈ Rn×I has more dimension than X ∈ Rn×p (i.e.,I > p), more equations than the number of parameters.

    gn(β) =1

    nG>(Y − Xβ)

    I If I = p, we could obtain an estimate of β by settinggn(β) = 0

    I More generally, for some positive matrix W ∈ RI×I , let

    Jn(β) = ngn(β)>W ngn(β)

    I The goal is to set Jn(β) close to zero.

    βGMM = arg min Jn(β)

    = {(X>G )W n(G>X )}−1(X>G )W n(G>Y )

    I The scale of W n does not change βGMM18 / 34

  • Optimal W n

    I It can be proved that when W n = ( 1nG>G σ̂2)−1, βGMM is

    optimal.

    I

    βGMM = {(X>G )(G>G )>(G>X )}−1(X>G )(G>G )>(G>Y )

    I The asymptotic distribution

    √n(β̂GMM − β) ∼ N(0, σ2Q−1GXQGGQ

    −1XG ),

    I In the economics literature, this is also referred as two-stageleast squares (2SLS) estimator, or instrumental variableestimator (IV)

    βIV = {(X>G )(G>G )>(G>X )}−1(X>G )(G>G )>(G>Y )

    19 / 34

  • Two-stage least squares (2SLS) estimator

    I 2SLS estimator

    βIV = {(X>G )(G>G )>(G>X )}−1(X>G )(G>G )>(G>Y )

    √n(β̂IV − β) ∼ N(0, σ2Q−1GXQGGQ

    −1XG ),

    I computationally simple and stable

    1. Compute X̂ (i.e., regress X on G , obtain fitted value)

    X̂ = G (G>G )−1GX

    2. Then regress Y on X̂

    βIV = (X̂>X̂ )−1X̂

    >Y

    = {(X>G )(G>G )>(G>X )}−1(X>G )(G>G )>(G>Y )

    20 / 34

  • Intuition behind 2SLS

    I Use instrumental variables (genetic variants) to exact thevariation of the exposure variable (X) that is independent ofconfounding variables

    X̂ = G (G>G )−1GX

    I Use this part of variation to estimate the causal effect.

    βIV = (X̂>X̂ )−1X̂

    >Y

    21 / 34

  • Caution about assumptions1. Independence between G and U, (usually untestable)

    I This is the assumption that introduce the “randomization”2. Known association between G and X (testable)

    I Weak genetic instrument can lead to poor estimation of causaleffect

    I Intuition: large variability in α̂1 will lead to large variability inβ̂IV .

    β̂IV =θ̂1α̂1,

    3. No other pathway from G to Y other than through X(exclusion restriction)

    I Pleiotropy

    22 / 34

  • Relaxed assumptions: adjust for known confoundersSuppose there is a set of known confounders W (populationstratification, demographic/behavioral/socio-economical factor),denote U to be unknown confounders

    1. G ⊥ U|W2. G correlates with X |W3. G ⊥ Y |X ,U,W

    I Testing Y ⊥ X |W ,U is equivalent to testing Y ⊥ G |W .I In linear models, β1 = θ1/α1 still holds

    E(Y |X ,W ,U) = β0 + β1X + β2W + β3UE(X |G ,W ) = α0 + α1G + α2WE(Y |G ,W ) = θ0 + θ1G + θ2W

    All the previous math works!23 / 34

  • Model diagnostic

    I Independence between G and UI None

    I Validity of the instrumentsI F-statistics (> 10) is the rule of thumb

    I PleitropyI Sargan’s testI J-statistics

    I Equavelence between βIV and βOLS .I Durbin-Wu-Hausman test

    24 / 34

  • Weak instrument variable

    I Evaluate the validity of the instrument variable by fitting thefollowing:

    X̂ = G (G>G )−1GX

    I Goodness of modeling fitting is assessed by F-statisticsI F-statistics > 10 indicates strong instrument variableI F-statistics < 10 indicates weak instrument variable, which can

    cause biased causal effect

    25 / 34

  • Overidentifying restrictions and Sargan’s test

    We can detect pleiotropy and the validity of IV if

    I The number of IVs (I ) is more than the number of causaleffects (p) to be estimated; not all I equations can be exactlyzero

    I The null hypothesis is G ⊥ (Y − Xβ)I Instrument is orthogonal to the error termI There is no direct effect left once conditional on X

    I Sargan’s test (Sargan, 1958; Small, 2007) for 2SLS for Iinstrumental variables and p = 1 causal effect :

    {G (Y−θ̂2SLSX )}>{σ̂2G>G}−1{G (Y−θ̂2SLSX )} → χ2(l−1)

    under the null that all instruments are valid.

    26 / 34

  • J-statistics

    Hansen (1982) gave general results

    Jn(β) = ngn(β)>Ŵ ngn(β)→ χ2(l − p)

    as long as Ŵ n converges to the optimal W 0 and β is efficientGMM estimator.

    I Large J-statistic will reject null hypothesis so that at least oneinstrument might be invalid

    27 / 34

  • Test the equality of IV estimator and OLS estimator

    The null hypothesis is OLS is consistent and fully efficient

    I If there is no unmeasured confounders, OLS estimator will beconsistent and efficient; IV is consistent under null oralternative

    I Large discrepancy between β̂OLS and β̂IV suggests that thereis confounding and OLS cannot be trusted.

    I Durbin-Wu-Hausman test (Hausman, 1978)

    (β̂IV − β̂OLS)>D−1(β̂IV − β̂OLS)→ χ2(p),

    where D = Var(β̂IV )− Var(β̂OLS)

    28 / 34

  • One sample MR

    I If we have everything in the same study (so-called one sample)

    I Instrument variable (genetic variants)I Exposure variableI Outcome variable

    I We could apply 2SLS to examine the causal effect of theexposure variable on the outcome variable

    I 2SLS has been implemented in the ivreg function in Rpackage AER

    29 / 34

  • Two sample MR

    I One sample MR with 2SLS works great.I Sometime, it is hard to have everything within the same study

    I Instrument variable (genetic variants)I Exposure variableI Outcome variable

    I We could apply two sample MR method.

    β̂IV =θ̂1α̂1

    I As long as we know the association between

    1. Instrument variable and the Exposure variable α̂12. Instrument variable and the Outcome variable θ̂1

    I The causal effect could be estimated

    30 / 34

  • Resources for instrumental variables

    I Summary statistics of genome-wide association study(GWAS):

    I GWAS catalog: https://www.ebi.ac.uk/gwas/I UK Biobank: https://docs.google.com/spreadsheets/d/

    1kvPoupSzsSFBNSztMzl04xMoSC3Kcx3CrjVf4yBmESU

    I Web application for two sample MR

    I MR-base http://app.mrbase.org/

    31 / 34

    https://www.ebi.ac.uk/gwas/https://docs.google.com/spreadsheets/d/1kvPoupSzsSFBNSztMzl04xMoSC3Kcx3CrjVf4yBmESUhttps://docs.google.com/spreadsheets/d/1kvPoupSzsSFBNSztMzl04xMoSC3Kcx3CrjVf4yBmESUhttp://app.mrbase.org/

  • Two sample MRMethods:

    I Single instrument variable: Wald method.

    β̂IV =β̂1α̂1

    I Single SNPI Ploygeneic risk score (PRS): summarize of multiple SNPs

    I Multiple instrument variables: inverse-variance weighted(IVW) linear regression

    See Hemani et al. (2018) for more details.32 / 34

  • Summary

    I MR is an effect way to establish causal relationship.I Goals:

    1. Test existence of causal effect.2. Estimate the strength of the causal effect.

    I Three core assumptions:

    1. Independence between G and U2. Established association between G and X3. No alternative pathway from G to Y , (exclusion restriction)

    I Methods:

    1. One sample MR2. Two sample MR

    I Model diagnostics.

    33 / 34

  • Reference

    I Major reference:

    I https://research.fhcrc.org/content/dam/stripe/hsu/

    files/IV_Mendelian_lecture_1.pdf

    I Other references:

    I See next page

    34 / 34

    https://research.fhcrc.org/content/dam/stripe/hsu/files/IV_Mendelian_lecture_1.pdfhttps://research.fhcrc.org/content/dam/stripe/hsu/files/IV_Mendelian_lecture_1.pdf

  • Chen, L., Smith, G. D., Harbord, R. M., and Lewis, S. J. (2008).Alcohol intake and blood pressure: a systematic reviewimplementing a mendelian randomization approach. PLoSmedicine, 5(3).

    Davies, N. M., Holmes, M. V., and Smith, G. D. (2018). Readingmendelian randomisation studies: a guide, glossary, and checklistfor clinicians. Bmj, 362:k601.

    Didelez, V., Meng, S., Sheehan, N. A., et al. (2010). Assumptionsof iv methods for observational epidemiology. Statistical Science,25(1):22–40.

    Fuchs, F. D., Chambless, L. E., Whelton, P. K., Nieto, F. J., andHeiss, G. (2001). Alcohol consumption and the incidence ofhypertension: The atherosclerosis risk in communities study.Hypertension, 37(5):1242–1250.

    Hansen, L. P. (1982). Large sample properties of generalizedmethod of moments estimators. Econometrica: Journal of theEconometric Society, pages 1029–1054.

    Hausman, J. A. (1978). Specification tests in econometrics.34 / 34

  • Econometrica: Journal of the econometric society, pages1251–1271.

    Hemani, G., Zheng, J., Elsworth, B., Wade, K. H., Haberland, V.,Baird, D., Laurin, C., Burgess, S., Bowden, J., Langdon, R.,et al. (2018). The mr-base platform supports systematic causalinference across the human phenome. Elife, 7:e34408.

    Marmot, M. G., Elliott, P., Shipley, M. J., Dyer, A. R., Ueshima,H., Beevers, D. G., Stamler, R., Kesteloot, H., Rose, G., andStamler, J. (1994). Alcohol and blood pressure: the intersaltstudy. Bmj, 308(6939):1263–1267.

    Sargan, J. D. (1958). The estimation of economic relationshipsusing instrumental variables. Econometrica: Journal of theEconometric Society, pages 393–415.

    Small, D. S. (2007). Sensitivity analysis for instrumental variablesregression with overidentifying restrictions. Journal of theAmerican Statistical Association, 102(479):1049–1058.

    Takagi, S., Baba, S., Iwai, N., Fukuda, M., Katsuya, T., Higaki, J.,Mannami, T., Ogata, J., Goto, Y., and Ogihara, T. (2001). The

    34 / 34

  • aldehyde dehydrogenase 2 gene is a risk factor for hypertensionin japanese but does not alter the sensitivity to pressor effects ofalcohol: the suita study. Hypertension Research, 24(4):365–370.

    Wald, A. (1940). The fitting of straight lines if both variables aresubject to error. The Annals of Mathematical Statistics,11(3):284–300.

    34 / 34