non-iterative, regression-based estimation of … regression-based estimation of haplotype...

Non-iterative, regression-based estimationof haplotype associations

Benjamin French, PhDDepartment of Biostatistics and Epidemiology

University of Pennsylvania

[email protected]

National Cancer CenterRepublic of Korea12 September 2012

G U L F O F M E X I C O

P A

C I F

I C O

C E A N

A T

L

A

N

T I

C

O

C

E A

N

L a k e S u p e r i o r

L a k

e

M i c

h i g

a n

L a k e H u r o n

L a k e

E r i e

O n t a r i o

L a k e

GULF OF ALASKAP A C I F I C O C E A N

BERING SEA

A R C T I C O C E A N

R U S S I A

CA

NA

DA

P A C I F I C O C E A N ME X

I CO

C A N A D A

C U B A

T HE B A H A M

A S

MAINE

VERMONT

NEW HAMPSHIRE

NEW YORKMASSACHUSETTS

CONNECTICUT

RHODE ISLAND

PENNSYLVANIA

NEW JERSEY

DELAWARE

MARYLAND

VIRGINIA

WESTVIRGINIA

OHIO

INDIANAILLINOIS

WISCONSIN

KENTUCKY

TENNESSEE NORTH CAROLINA

SOUTH

CAROLINA

GEORGIAALABAMA

FLORIDA

MISSISSIPPI

LOUISIANA

TEXAS

ARKANSAS

MISSOURI

IOWA

MINNESOTA

NORTH DAKOTA

SOUTH DAKOTA

NEBRASKA

KANSAS

OKLAHOMA

NEW MEXICO

COLORADO

WYOMING

MONTANA

IDAHO

UTAH

ARIZONA

NEVADA

CALIFORNIA

OREGON

WASHINGTON

DC

MI C

HI G

AN

HAWAII

ALASKA

The National Atlas of the United States of AmericaU.S. Department of the InteriorU.S. Geological Survey

Where We Arenationalatlas.gov TM

OR

states2.pdf INTERIOR-GEOLOGICAL SURVEY, RESTON, VIRGINIA-2003

PACIFIC OCEA N

AT

LA

NT

I CO

CE

AN

HAWAII

ALASKA

200 mi0

200 km0

100 mi0

100 km0

300 mi0

300 km2000 100

100 200

STATES

Collaborators

Nandita Mitra, PhDDepartment of Biostatistics and EpidemiologyUniversity of Pennsylvania

Thomas P Cappola, MDPenn Cardiovascular InstituteUniversity of Pennsylvania

Thomas Lumley, PhDDepartment of StatisticsUniversity of Auckland

B French ([email protected]) Estimating haplotype associations September 2012 3 / 31

Goals

• To integrate haplotypes into large association studies such thathaplotype imputation is done once as a data-processing step

I Case-control studies (binary outcome) [French et al., 2006]I Prospective studies (censored survival outcome) [French et al., 2012]

• To allow haplotype associations to be estimated in general-purposestatistical software (eg R) by researchers expert in the subject matter

“In world historical terms there is a lot to be saidfor keeping data analysis out of the hands of statisticians”

— Thomas Lumley


Phase ambiguity

• Observed data is composed of a set of unphased genotypes

• Diplotype (pair of haplotypes) may be ambiguous; may not knowwhich allele was transmitted from maternal or paternal chromosome

• Missing data problem; impute the unobserved diplotype

Father Mother

Child

A T

C G

A T

C G

AT/CG or

AG/CT


Haplotype imputation

Expectation-maximization (EM) algorithm

• E: calculate expected phase given haplotype frequencies

• M: calculate MLEs for haplotype frequencies given phase

• Software: haplo.stats [Sinnwell and Schaid, 2012]

Bayesian algorithm

• Observed genotype data combined with expected haplotype patterns

• Haplotypes estimated from posterior distribution

• Software: PHASE [Stephens and Donnelly, 2003]


Diplotype uncertainty

Angiotensin II receptor type 1 (AGTR1)

Haplotype Diplotype

Label Haplotype frequency probability

D TCCACGCATCTT 0.139 0.81

F TCTGTGCATCTC 0.290

C TCCACGCATCTC 0.034 0.19

G TCTGTGCATCTT 0.272

Rare TCCGCGCATCTC < 0.001 < 0.01

Rare TCTATGCATCTT < 0.001


Target of inference

If diplotypes were known, could fit a standard regression model includingdiplotypes Di and environmental exposures Zi (possibly with interaction)

• Logistic regression: Yi = {1 = case, 0 = control}

logit P[Yi = 1] = α + DiβD + Ziβ

Z + · · ·

• Cox regression: Yi = min(Ti ,Ci ); event indicator δi = 1[Yi = Ti ]

log λi (t) = log λ0(t) + DiβD + Ziβ

Z + · · ·

in which β = {βD , βZ} represent association parameters of interest

? We integrate the set of imputed haplotypes into these regression models


Estimating associations

Non-iterative weighted estimation [French et al., 2006]

1. Impute haplotypes and estimate population haplotype frequencies

2. Create multi-record data for each individual

I Design matrix: set of diplotypes consistent with observed genotype,possibly including environmental exposures

I Weights equal to conditional probability of each diplotype

Weight A B C D E F G H I Rare

0.81 0 0 0 1 0 1 0 0 0 0

0.19 0 0 1 0 0 0 1 0 0 0

0.01 0 0 0 0 0 0 0 0 0 2

3. Estimate associations using a weighted regression model

I Logistic regression for binary outcomesI Cox regression for censored survival outcomesI Robust or ‘sandwich’ standard error estimatorI Account for uncertainty in phase



Non-iterative weighted estimation

• Estimates are obtained from estimating functions

I Logistic regression for binary outcomes

U(β) =n∑

i=1

∑d∈d(Gi )

πid(pid , β)Xid [Yid − expitXidβ]

I Cox regression for censored survival outcomes

U(β) =n∑

i=1

∑d∈d(Gi )

∂lid(β)

∂β

lid(β) = δidπid(pid , β)

Xidβ − log∑

i ′∈R(ti )

expXi ′dβ

• Inference based on univariable or multivariable Wald tests

• Estimator is valid and efficient under the null hypothesis



Weighted haplotype combination [Rebbeck et al., 2009]

1. Impute haplotypes and estimate population haplotype frequencies

2. Create single-record data for each individual

I Design matrix: weighted combination of diplotypes consistentwith observed genotype, with weights equal to conditional probabilityof each diplotype, possibly including environmental exposures

A B C D E F G H I Rare

0 0 0.19 0.81 0 0.81 0.19 0 0 0.01

3. Estimate associations using an unweighted regression model

I Logistic regression for binary outcomesI Cox regression for censored survival outcomesI Account for uncertainty in phase



Weighted haplotype combination

• Estimates are obtained from estimating functions

I Logistic regression for binary outcomes

U(β) =n∑

i=1

Xi [Yi − expitXiβ]

I Cox regression for censored survival outcomes

U(β) =n∑

i=1

∂li (β)

∂β

li (β) = δi

Xiβ − log∑

i ′∈R(ti )

expXi ′β

• Inference based on univariable or multivariable Wald tests

• Estimator is exactly valid under the null hypothesis


Simulation study

In which situations do non-iterative methods provide reliable results?

Angiotensin II receptor type 1 (AGTR1)

Label Haplotype Frequency

A ATTATGCATCTC 0.029

B ATTATGTGATCC 0.051

C TCCACGCATCTC 0.027

D TCCACGCATCTT 0.090

E TCTGTGCAACTT 0.029

F* TCTGTGCATCTC 0.223

G TCTGTGCATCTT 0.188

H TTTACACATCTC 0.038

I TTTACACATCTT 0.032

* Referent

n = 500, 25% censoring (independent)


Simulation results: Moderate effects

HaplotypeHazard ratio

App

roxi

mat

e bi

as

−2

−1

01

2

A1/2.0 B1/1.75 C1.25 D1.5 E1/1.25 G1/1.5 H1.75 I2.0

Weighted estimationWeighted combinationKnown phase



% coverage of 95% confidence intervals

Hazard Weighted Weighted Known

Haplotype ratio estimation combination phase

A 1/2.0 94 95 95

B 1/1.75 94 95 95

C 1.25 80 95 95

D 1.5 93 94 95

E 1/1.25 93 95 94

G 1/1.5 89 94 95

H 1.75 86 96 95

I 2.0 92 95 96



Rejection rate (%) of two-sided hypothesis tests



A 1/2.0 90 91 92

B 1/1.75 91 93 94

C 1.25 62 19 24

D 1.5 92 90 93

E 1/1.25 15 15 18

G 1/1.5 95 96 98

H 1.75 99 82 96

I 2.0 96 93 98


Simulation results: Strong SNP effects


App

roxi

mat

e bi

as

−2

−1

01

2

A1.0 B1.0 C1.0 D4.0 E4.0 G4.0 H1.0 I4.0







A 1.0 93 94 94

B 1.0 94 94 94

C 1.0 92 93 95

D 4.0 94 95 95

E 4.0 92 94 95

G 4.0 94 94 95

H 1.0 94 94 95

I 4.0 92 93 92






A 1.0 7 6 6

B 1.0 6 6 6

C 1.0 8 7 5

D 4.0 100 100 100

E 4.0 100 100 100

G 4.0 100 100 100

H 1.0 6 6 5

I 4.0 100 100 100


Simulation results: Strong non-SNP effects


App

roxi

mat

e bi

as

−2

−1

01

2

A1.0 B1.0 C4.0 D1.0 E4.0 G4.0 H4.0 I1.0







A 1.0 93 95 95

B 1.0 90 95 95

C 4.0 0 96 94

D 1.0 94 95 94

E 4.0 65 96 95

G 4.0 0 91 95

H 4.0 0 88 94

I 1.0 81 89 95






A 1.0 7 5 5

B 1.0 10 5 5

C 4.0 37 100 100

D 1.0 6 5 6

E 4.0 100 100 100

G 4.0 100 100 100

H 4.0 83 100 100

I 1.0 19 11 5


Application

HSPB7-CLCNKA haplotypes and adverse events in chronic heart failure

• Regulate renal potassium channels to control blood pressure

• SNP associated with heart failure in a large case-control study[Cappola et al., 2011]

• Genotypes available for 1150 genetically inferred Caucasianswith heart failure enrolled in a prospective study

• 70% male; median age at study entry, 58 years

• 14 pre-selected SNPsInferred 10 common haplotypes (frequency > 0.02)

• 65% had an unambiguous diplotype90% had a highest posterior probability > 0.765


Application

HSPB7-CLCNKA haplotypes and adverse events in chronic heart failure

• Outcome: time to all-cause mortality or cardiac transplantation

I Median follow-up, 3 years; maximum, 5 years

I 22% experienced an adverse event

• Cox regression framework

I Stratified by 4-level classification for disease severity

I Adjusted for gender, age, heart failure etiology, clinical site

I Time-varying covariate for age (exhibited non-proportional hazards)


Application results: Weighted estimation

Label Haplotype Frequency HR (95% CI) P

Q AGAGCGAGACGAGG 0.036 1.19 (0.80, 1.77) 0.39

R AGAGCGAGGGAAGG 0.160 1.04 (0.80, 1.35) 0.79

S AGAGCGGAGCAAGA 0.036 1.20 (0.80, 1.77) 0.38

T AGCGAGAGGCAAGA 0.066 0.55 (0.34, 0.88) 0.01

U GACGCGGAGCGCGG 0.063 0.80 (0.53, 1.20) 0.28

V GGAACAAGGGAAGG 0.037 0.49 (0.26, 0.92) 0.03

W GGAACAGAGCAAGA 0.299 Referent

X GGAACAGAGCAAGG 0.048 1.36 (0.95, 1.95) 0.09

Y GGAGCAAGGCAAGG 0.050 1.15 (0.78, 1.69) 0.49

Z GGCGCGGAGCAAGG 0.031 1.09 (0.62, 1.92) 0.76

Overall 0.02


Application results: Weighted combination

Label Haplotype Frequency HR (95% CI) P

Q AGAGCGAGACGAGG 0.036 1.19 (0.76, 1.87) 0.44

R AGAGCGAGGGAAGG 0.160 1.05 (0.81, 1.37) 0.73

S AGAGCGGAGCAAGA 0.036 1.20 (0.74, 1.94) 0.47

T AGCGAGAGGCAAGA 0.066 0.53 (0.32, 0.90) 0.02

U GACGCGGAGCGCGG 0.063 0.80 (0.52, 1.21) 0.29

V GGAACAAGGGAAGG 0.037 0.43 (0.21, 0.88) 0.02

W GGAACAGAGCAAGA 0.299 Referent

X GGAACAGAGCAAGG 0.048 1.44 (0.95, 2.19) 0.09

Y GGAGCAAGGCAAGG 0.050 1.15 (0.77, 1.72) 0.49

Z GGCGCGGAGCAAGG 0.031 1.10 (0.64, 1.87) 0.73

Overall 0.03


R packages: Non-iterative weighted estimation

haplo.ccs [French and Lumley, 2011]

• Weighted logistic regression for binary outcomes

• Depends on haplo.stats package to impute haplotypes

• Calls glm(..., family=quasibinomial(link=logit))

• Includes GEE-type ‘sandwich’ standard error estimator

haplo.cph (in process)

• Weighted Cox regression for censored survival outcomes

• Will depend on haplo.stats package to impute haplotypes

• Will call cph(..., robust=TRUE) from Design package

• Allow stratification and time-varying exposures


Conclusions

• Non-iterative estimation methods

I Incorporate all diplotypes consistent with the observed genotype,either weighted (weighted estimation) or averaged (weightedcombination) by the conditional probability of each diplotype

I Provide valid tests for genetic associations and reliable estimatesof modest genetic effects of common haplotypes

• Convenient regression framework

I Adjustment for or interaction with environmental exposuresI Stratification and time-varying exposures in Cox regression

• Straightforward to implement in R

I haplo.ccs for binary outcomesI haplo.cph for censored survival outcomes

• See [French et al., 2006; 2012] for implementation in Stata


Target

Haplotype frequency

“Rel

ativ

e ris

k”

High

Common Rare

May not exist

Mostly known

Our focus

Probably can’t be found

Low


Limitations

Our methods and/or software may not be applicable to

• Related individuals

• Rare haplotypes

• Small studies

• Longitudinal outcomes


References

dbe.med.upenn.edu/biostat-research/bcfrench

1. Cappola TP, Matkovich SJ, et al. 2011. Loss-of-function DNA sequence variantin the CLCNKA chloride channel implicates the cardio-renal axis in interindividualheart failure risk variation. Proc Natl Acad Sci U S A 108:2456–61.

2. French B, Lumley T. 2011. haplo.ccs: Estimate haplotype relative risksin case-control data. R package 1.3.1.

3. French B, Lumley T, et al. 2006. Simple estimates of haplotype relative risksin case-control data. Genet Epidemiol 30:485–94.

4. French B, Lumley T, et al. 2012. Non-iterative, regression-based estimationof haplotype associations with censored survival outcomes. Stat App Genet MolBiol 11:4.

5. Rebbeck TR, Mitra N, et al. 2009. Modification of ovarian cancer riskby BRAC1/2-interacting genes in a multi center cohort of BRAC1/2 mutationcarriers. Cancer Res 71:5792–805.

6. Sinnwell JP, Schaid DJ. 2012. haplo.stats: Statistical analysis of haplotypes withtraits and covariates when linkage phase is ambiguous. R package version 1.5.5.

7. Stephens M, Donnelly P. 2003. A comparison of Bayesian methods for haplotypereconstruction from population genotype data. Am J Hum Genet 73:1162–69.


dbe.med.upenn.edu/biostat-research/bcfrench

non-iterative, regression-based estimation of … regression-based estimation of haplotype...

Documents