identifying biologically relevant amino acids in immunogenetic studies richard m. single department...
TRANSCRIPT
Identifying Biologically Relevant Amino Acids in Immunogenetic Studies
Richard M. Single
Department of Mathematics and Statistics
University of Vermont
• HLA background and nomenclature• Asymmetric Linkage Disequilibrium (ALD)
– Motivation, Definition & Example
• Amino acid level analyses of HLA disease associations– SFVT Analysis & Pairwise allele level analyses– Conditional Haplotype analyses & ALD
• Identifying units of selection– ALD as a tool
Outline
TCR
= peptide fragment
-m
TCR
HLA class I HLA class II
TCR = T-cell receptor
-m = microglobulin
HLA molecules are cell-surface proteins that present peptide fragments to T-cells
• HLA molecules bind specific sets of peptides (based on structure)• Any given HLA allele codes to present a subset of available peptides to T-cells
HLA-A * 24 : 02 : 01 : 02 : L
Locus Field 1 (2-Digit)
Serological level(where possible)
Field 2 (4-Digit)
Peptide level(amino acid difference)
Field 3(6-Digit)
Nucleotide level[silent]
(synonymous substitutions)
Field 4(8-Digit)
Intron level (3’ or 5’
polymorphism)
ExpressionN = nullL = lowS = soluble…
• For most analyses, we want to distinguish among unique peptide sequences, i.e., 2 fields (“4-digit”) level
• This level of resolution treats alleles with the same peptide sequence for exons 2 & 3 (class I) or exon 2 (class II) as being equivalent [“binning” alleles]
HLA Allele Nomenclature
HLA Nomenclature and why it matters
• Challenges for HLA data management and analysis– The HLA genes are very polymorphic;– HLA nomenclature is complicated;– There are multiple ways to generate HLA data;– All common typing systems generate ambiguous data;– There are multiple ways to report alleles and ambiguities;
These issues make meta-analyses of HLA data from
different sources very difficult.
Extending STREGA to Immunogenomic Studies
• The STrengthening the REporting of Genetic Association studies (STREGA) statement provides community-based data reporting and analysis standards for genomic disease association studies
• The IDAWG (immunogenomics.org) has proposed an extension of STREGA: STrengthening the REporting of Immunogenomic Studies (STREIS)
From STREGA to STREIS
Extensions to the STREGA guidelines for immunogenomic data include:
• Describing the system(s) used to store, manage, and validate genotype and allele data
• Documenting all methods applied to resolve ambiguity • Defining any codes used to represent ambiguities
- e.g., NMDP codes - A*0201/0209/0266 = A*02AJEY- A*0201/0209/0266/0275/0289 = A*02BSFJ
• Describing any binning or combining of alleles into common categories- e.g., G-codes
- A*0201/ 0209/ 0243N/ 0266/ 0275/ 0283N/ 0289 = “A020101g”
• Avoiding the use of subjective terms (e.g. high-resolution typing), that may change over time
• Immunology Database and Analysis Portal (www.ImmPort.org) Developed under the Bioinformatics Integration Support Contract (BISC) for NIH, NIAID, & DAIT (Division of Allergy, Immunology, and Transplantation)
– Data validation pipeline– Analysis tools– Standardized ambiguity reduction tools – Data from a large number of immunogenomic studies
• ImmunoGenomics Data Analysis Working Group (www.immunogenomics.org) (www.IgDAWG.org)
An international collaborative group working to …– facilitate the sharing of immunogenomic data (HLA, KIR, etc.) and – foster consistent analysis and interpretation of immunogenomic data
Resources for HLA Data Validation & Analysis
• HLA background and nomenclature• Asymmetric Linkage Disequilibrium (ALD)
– Motivation, Definition & Example
• Amino acid level analyses of HLA disease associations– SFVT Analysis & Pairwise allele level analyses– Conditional Haplotype analyses & ALD
• Identifying units of selection– ALD as a tool
Outline
Asymmetric Linkage Disequilibrium (ALD)
- Standard LD measures give an incomplete description of the correlation of genetic variation at two loci when there are different numbers of alleles at the loci.
- We developed a pair of conditional asymmetric LD (ALD) measures that more accurately capture this information.
- For disease association studies, the ALD can help to identify when stratification analyses can be applied to detect primary disease predisposing genes.
- For evolutionary studies, the ALD can be informative for the study of forces such as selection acting on individual amino acids, or other loci in high LD.
- For SNP studies, ALD measures can be used for analyses of LD between haplotype blocks, for SNP–gene LD, and for haplotype block–gene LD.
1 1
I J
iji ji j
D p q D
12
12
2
21 1 2
min( 1 1) min( 1 1)
I J
ij i ji j LD
n
D p qX N
WI J I J
The two most common measures of the strength of LD are:
(1) the normalized measure of the individual LD values, namely Dij' = Dij / Dmax (Lewontin 1964); and
(2) the correlation coefficient r for bi-allelic data, which is most often reported as r2 = D2 / (pA1 pA2 pB1 pB2).
r =1 only when the allelic variations at the two loci show 100% correlation
Their multi-allelic extensions are:
Linkage Disequilibrium (LD) Measures
• When there are different numbers of alleles at two loci, the direct correlation property for the r measure is not retained.
• The asymmetric LD (ALD) measures more accurately reflect covariation at two loci.
- WA/B and WB/A describe variation observed at the 1st locus conditioned on the 2nd
• Example: (two and three alleles at the A and B loci)
f(A1B1) = 0.3, f(A2B2) = 0.5, f(A2B3) = 0.2,
Wn = 1, WA/B = 1 and WB/A = 0.73,
There is variation at the B locus on haplotypes containing the A2 allele there is not 100% correlation.
- ALD measures indicate that, with appropriate sample size, stratification analyses could be carried out for some comparisons.
- Wn = 1 could result in passing over these data for conditional analyses.
Asymmetric LD measures: WA/B and WB/A
Standard LD measures D’ and Wn
Standard LD measures (overall D’ & Wn) assume/force symmetry, even though with >2 alleles per locus that is not the case
Data Source: Immport Study#SDY26: Identifying polymorphisms associated with risk for the development of myopericarditis following smallpox vaccine
Asymmetric Linkage Disequilibrium (ALD)
Interpretation:
ALD for HLA-DRB1 conditioning on HLA-DQA1 WDRB1 / DQA1 = .58
ALD for HLA-DQA1 conditioning on HLA-DRB1 WDQA1 / DRB1 = .95
The overall variation for DRB1 is relatively high given specific DQA1 alleles.
The overall variation for DQA1 is relatively low given specific DRB1 alleles.
ALDrow gene conditional on column gene
Asymmetric Linkage Disequilibrium (ALD)Table 1. Linkage disequilibrium and genetic diversity measures
Description
Definition of Measuresa 1. Single locus homozygosity (F)b
FA = i pAi
2 2. Haplotype specific homozygosity (HSF)c
FA/Bj = i (fij / pBj)
2
3. Overall weighted HSF valuesd FA/B (and FB/A)
FA/B = j (FA/Bj) (pBj) = FA + i j Dij
2 / pBj
4. Multi-allelic ALDe squared WA/B (and WB/A)
WA/B
2 = (FA/B−FA) / (1−FA)
Thomson and Single(2014) Genetics
Asymmetric Linkage Disequilibrium (ALD)Table 1. Linkage disequilibrium and genetic diversity measures
Description
Definition of Measuresa 1. Single locus homozygosity (F)b
FA = i pAi
2 2. Haplotype specific homozygosity (HSF)c
FA/Bj = i (fij / pBj)
2
3. Overall weighted HSF valuesd FA/B (and FB/A)
FA/B = j (FA/Bj) (pBj) = FA + i j Dij
2 / pBj
4. Multi-allelic ALDe squared WA/B (and WB/A)
WA/B
2 = (FA/B−FA) / (1−FA)
If both loci are bi-allelic: WA/B
2 = [i j (Dij2 / pBj)] / (1 − FA) = D2 / (pA1 pA2 pB1 pB2) = r2, since D11= −D12= −D21= D22=D
Thomson and Single(2014) Genetics
Other Conditional Measures of LD
• Other measures of LD that are conditional have been proposed (Nei and Li, 1980; Chakravarti et al, 1984; Hudson, 1985; Kaplan and Weir, 1992; Guo SW, 1997).
- They measure association between alleles at a marker locus (locus B) and alleles at a disease locus (locus A).
- They were developed to account for study designs in which individuals are not randomly sampled from a single population, but where sampling intensity varies within disease categories.
- They are equivalent to Somer’s D statistic defined on the contingency table relating two categorical variables
• In contrast, our statistic is a population-based measure that does not depend on a specific patient sampling scheme.
ALD & tag-SNPs in the HLA region
• DeBakker et al. (2006) identified tag-SNPs based on r2 for SNPs with recoded HLA alleles (recoded as presence/absence of each specific HLA allele)
DeBakker et al. (2006) Nature Genetics
ALD & tag-SNPs in the HLA region
Thomson and Single(2014) Genetics
• HLA background and nomenclature• Asymmetric Linkage Disequilibrium (ALD)
– Motivation, Definition & Example
• Amino acid level analyses of HLA disease associations– SFVT Analysis & Pairwise allele level analyses– Conditional Haplotype analyses & ALD
• Identifying units of selection– ALD as a tool
Outline
Risk Category
I
I
II
II
II
II
II
III
III
III
III
DRB1
*08:01
*11:04
*13:01
*11:01
*01:01
*03:01
*13:02
*04:04
*15:01
*07:01
*04:01
sum
total
patients
102
57
90
60
74
89
28
7
38
30
21
596
708
controls
13
11
38
36
50
61
23
16
80
65
47
440
546
OR
6.9
4.3
1.9
1.3
1.2
1.1
0.9
0.3
0.3
0.3
0.3
Overall p-value < 2.6E-27
Juvenile Idiopathic Arthritis oligoarticular persistent (JIA-OP) Common HLA-DRB1 alleles
AA 86 implicated via pairwise within serogroup analysis
Sequence Feature Variant Type (SFVT) Analysis - Overview
• An exploratory approach for genetic association studies that uses combinations of amino acid (AA) residues as the unit of analysis.
• Goal: – To identify biologically relevant amino acid (AA) residues that
account for the major disease risk attributable to HLA
• Genes/proteins are sub-divided into biologically relevant units affecting gene expression and/or protein function (i.e., Sequence Features)– Polymorphic AAs (single AA sites)– Structural features (e.g., beta 1 domain, alpha-helix 2, …)– Functional features (e.g., peptide binding, T-cell interacting, …)– Combinational (e.g., alpha-helix 2 & peptide binding, …)
www.immport.org
Summary of SFVT Analysis
HLA Typing
(Allele-level)
Group HLA alleles based on structural/ functional sequence motifs
(Sequence Features)
Perform disease association tests based sequence motifs
(Sequence Feature-level)
Choose the top Sequence Features associated with disease risk for further study
Identify individual AAs
& combinations of AAs directly involved in
disease risk
ORs & p-values
LD patterns
Conditional/ Stratificationanalyses
Representative Sequence Features: HLA-DRB1
Table from Karp et al. (2010) Hum Molec Genet
Sequence Feature ID
Sequence Feature Name
Sequence Feature Type
Amino Acid Position(s)
# of Variant Types
HLA-DRB1_SF1 allele Standard Allele Designation NA 497
HLA-DRB1_SF4 mature protein Structural - Complete protein 1..237 52
HLA-DRB1_SF5 beta 1 domain Structural - Domain 1..95 69
HLA-DRB1_SF12 loop between beta-strands 1 & 2 Structural - Secondary structure motif 19, 20, 21, 22 5
HLA-DRB1_SF13 beta-strand 2 Structural - Secondary structure motif 23..32 28
HLA-DRB1_SF21 alpha-helix 2 Structural - Secondary structure motif 65..72 29
HLA-DRB1_SF128 T cell receptor binding Functional
60, 64, 65, 66, 67, 69, 70, 71, 73, 76, 77, 78, 80, 81, 82, 84, 85 81
HLA-DRB1_SF137 peptide antigen binding pocket 7 Functional28, 30, 47, 61, 67, 71 53
HLA-DRB1_SF163 alpha-helix 2_peptide antigen binding Structural_Functional Combination 67, 70, 71 21
HLA-DRB1_SF164 alpha-helix 2_T cell receptor binding Structural_Functional Combination65, 66, 67, 69, 70, 71 24
Variant Types for HLA-DRB1_SF153“beta-strand 2_peptide antigen binding”
… 5 of 11 Variant Types (VTs) for Sequence Feature 153 (SF153)
DRB1_SF153_VT1 (LEC): DRB1*0101, 0102, 0103, 0104, 0105, …DRB1_SF153_VT2 (FEL): DRB1*0113, 0701, 0703, 0704, 0705, …DRB1_SF153_VT3 (YDY): DRB1*0301, 0304, 0305, 0306, 0308, …
Karp et al 2010 Hum Mol Gen
DRB1: AAs 13, 67, 37, 57, 74, 86 in binding pockets 6, 4, 7, and 9
DRB1 Amino Acids p-value ORmax ORmin
AA position 13 13 2.00E-28 4.9 0.33
Pocket 6 11, 13, 30 4.00E-28 7.1 0.31
Pocket 4 13, 26, 28, 70, 71, 74, 78 6.00E-28 6.8 0.28
DRB1 allele 9…………………….86 1.00E-27 9.4 0.28
Pocket 7 28, 30, 47, 61, 67, 71 9.00E-27 9.4 0.28
AA positions X-LD [11, 12, 10, 16] 9.00E-25 3.2 0.33
AA position 67 67 3.00E-17 3.4 0.54
Pocket 9 9, 37, 57 4.00E-16 3.9 0.33
AA position 74 74 4.00E-16 6.8 0.33
AA position 37 37 4.00E-13 1.8 0.34
AA position 57 57 6.00E-13 3.9 0.44
…………. …… ……… … ….
AA position 86 86 ns 1.1 0.9
AAs underlined have a potential effect on disease risk, the effect of those in italics may be explained by LD with AA 13. Note that AA 86 is NS by SFVT analysis
SFVT analysis DRB1 summary for JIA-OP
SFVT Analysis - Summary
• An exploratory approach for identifying biologically relevant AAs in HLA association studies
• Pros – Utilizes information about the inter-relationships among HLA alleles– Covers more extended protein regions than single amino acid-based analyses
• Cons– Care is needed to address complex patterns of LD among AAs and SFs in
order to identify AAs directly involved in disease– Due to multiple comparisons with highly correlated SFs appropriate p-value
adjustments are necessary– The effects of some amino acids (or combinations) may be missed, so complementary analyses are useful
DRB1 Amino Acids 13 and 67 13 - 67 patients controls
OR G - F108 14 6.8 S - F 130 49 2.3 S - I 131 71 1.5 G - I 13 8 1.3 S - L 102 80 1.0 R - I 44 91 0.2 others 270 233
p < 8E-9AA 13 involvedor an AA in LD
overall p < 2E-28
Conditional Haplotype Analysis of JIA-OP
DRB1 Amino Acids 13 and 67 13 - 67 patients controls
OR G - F108 14 6.8 S - F 130 49 2.3 S - I 131 71 1.5 G - I 13 8 1.3 S - L 102 80 1.0 R - I 44 91 0.2 others 270 233
p < 0.002AA 67 involvedor an AA in LD
An extensive set of CH analyses are required, as well as consideration of LD patterns
p < 0.001AA 67 involvedor an AA in LD
Conditional Haplotype Analysis of JIA-OP
DRB1: AAs 13, 67, 37, 57, 74, 86 in binding pockets 6, 4, 7, and 9
DRB1 Amino Acids p-value ORmax ORmin
AA position 13 13 2.00E-28 4.9 0.33
Pocket 6 11, 13, 30 4.00E-28 7.1 0.31
Pocket 4 13, 26, 28, 70, 71, 74, 78 6.00E-28 6.8 0.28
DRB1 allele 9…………………….86 1.00E-27 9.4 0.28
Pocket 7 28, 30, 47, 61, 67, 71 9.00E-27 9.4 0.28
AA positions X-LD [11, 12, 10, 16] 9.00E-25 3.2 0.33
AA position 67 67 3.00E-17 3.4 0.54
Pocket 9 9, 37, 57 4.00E-16 3.9 0.33
AA position 74 74 4.00E-16 6.8 0.33
AA position 37 37 4.00E-13 1.8 0.34
AA position 57 57 6.00E-13 3.9 0.44
…………. …… ……… … ….
AA position 86 86 ns 1.1 0.9
AAs underlined have a potential effect on disease risk, the effect of those in italics may be explained by LD with AA 13. Note that AA 86 is NS by SFVT analysis
SFVT analysis DRB1 summary for JIA-OP
LD for DRB1 AAs
Wn JIA controls
ALDrow gene conditional on column gene
Asymmetric LD (ALD)Wn (symmetric)
Conditional Haplotype Analysis of JIA-OP
11_13 Cases Controls OR S-G 121 22 4.89 p<3.6E-06 S-S 363 200 1.81 D-F 9 6 1.15 ns
L-F 87 66 1.01 V-H 46 84 0.38 P-R 50 99 0.34 G-Y 30 65 0.33 Total 708 546
12_13 Cases Controls OR
T-G 121 22 4.91 p<3.6E-06 T-S 363 200 1.82 K-F 98 76 0.994 K-H 46 84 0.382 p<1.2E-05 K-R 50 99 0.343 K-Y 30 65 0.327 Total 708 546
OR AA position 13 67 74 86 37 57
6.9 DRB1*0801 G F L G Y S
4.3 DRB1*1104 S F A V Y D
1.9 DRB1*1301 S I A V N D
1.3 DRB1*1101 S F A G Y D
1.2 DRB1*0101 F L A G S D
1.1 DRB1*0301 S L R V N D
0.9 DRB1*1302 S I A G N D
0.3 DRB1*0404 H L A V Y D
0.3 DRB1*1501 R I A V S D
0.3 DRB1*0701 Y I Q G F V
0.3 DRB1*0401 H L A G Y D
• These alleles show the strongest evidence for direct involvement in JIA-OP disease risk
• The 6 identified AA sites uniquely define each allele, preventing further stratification analyses
Common DRB1 Alleles & AAs in JIA-OP
• HLA background and nomenclature• Asymmetric Linkage Disequilibrium (ALD)
– Motivation, Definition & Example
• Amino acid level analyses of HLA disease associations– SFVT Analysis & Pairwise allele level analyses– Conditional Haplotype analyses & ALD
• Identifying units of selection– ALD as a tool
Outline
• Balancing selection can result from:
- Overdominance/Heterozygote advantage- Frequency-dependent selection- Selective regimes that change over time/space
• For HLA, the common factor in these models is rare allele advantage, which is consistent with a pathogen-directed frequency-dependent selection model.
• At the Amino Acid (AA) level we see- High AA variability at antigen recognition sites (ARS)- Relatively even AA frequencies at ARS sites- Higher rates of non-synonymous vs. synonymous changes at ARS
Balancing Selection Operates at Most HLA Loci
Homozygosity (F) and theNormalized Deviate (Fnd)
0
0.05
0.1
0.15
0.2
0.25
0.3
allele
alle
le fr
eque
ncy
0
0.1
0.2
0.3
0.4
0.5
0.6
allele
alle
le fr
eque
ncy
0
0.02
0.04
0.06
0.08
0.1
0.12
alleleal
lele
freq
uenc
y
Neutrality
FOBS ≈ FEQ
Fnd ≈ 0
Directional Selection
FOBS > FEQ
Fnd > 0
Balancing Selection
FOBS < FEQ
Fnd < 0
2
1
k
iiF p
Fnd = (FOBS - FEQ) / SD(FEQ)
Fnd for DRB1 AA sites in JIA Controls
• Fnd << 0 gives evidence of possible balancing selection.• Fnd >> 0 gives evidence of possible directional selection.
Fnd for DRB1 AA sites (Meta-Analysis)
Fnd for all polymorphic sites in a meta-analysis of 57 populations
• Fnd << 0 gives evidence of possible balancing selection.• Fnd >> 0 gives evidence of possible directional selection.
Asymmetric LD : JIA – Controls(Row gene conditional on column gene)
Wn : JIA – Controls
Asymmetric LD (ALD)
LD for DRB1 AAs
Wn (symmetric)
Acknowledgements
University of Sao PauloDiogo Meyer
University of GrazWolfgang Helmberg
Cincinnati Children’s HospitalSusan ThompsonDavid Glass
University of TexasNishanth MarthandanPaula GuidryDavid KarpRichard Scheuermann
Children's Hospital Oakland Research Inst.Steven J. MackJill A. Hollenbach
Harvard Medical SchoolAlex Lancaster
UC BerkeleyGlenys Thomson
UC San FranciscoOwen Solberg
Roche Molecular SystemsHenry A. Erlich
Anthony Nolan Research Inst.Steven G.E. MarshMatthew Waller
NCBI/NIHMike Feolo
NGITJeff WiserPatrick DunnTom Smith
Distributions of Fnd values
Results from a meta-analysis of 497 HLA population studies in ten geographic regions
Solberg et al., 2008
Distributions of Fnd values
• Cano & Fernandez-Vina (2009) described two sequence dimorphisms that define the primary immunodominant serological epitopes for HLA-DPB1.
• All DPB1 alleles can be divided into four serologic categories (DP1, DP2, DP3, and DP4):
Evidence of Balancing Selection at HLA-DPB1
Serological Category 56 85 86 87DP1 A E A VDP2 E G P MDP3 E E A VDP4 A G P M
AA position
Global Distribution of DP serological categories
.
.
Fnd for DPB1 Alleles ( )& DP Serological Categories ( )
Evidence of Balancing Selection at HLA-DPB1
• We constructed a randomization test (“random binning” to 4 categories) to ensure that the effect was not driven by differences in the observed number of variants at the allele-level vs. serotype-level.
• Randomization tests have confirmed results for European populations more than in other geographic regions
- A possible ascertainment bias? (many common alleles were first identified in European populations)
- Could natural selection favoring DPB1 diversity at the serologic level be greater in Europe?
Evidence of Balancing Selection at HLA-DPB1
Supplementary Figure S1. Mean Fnd values for trios of variant DPB1 Exon 2 amino acid positions
-1.5
-1
-0.5
0
0.5
1
0 50 100 150 200 250 300 350
mean Fnd
Amino-Acid Position Trio
mean Fnd values in variable sets of 3 amino-acid positions vs 36/56/85 paired trios
Acknowledgements
University of Sao PauloDiogo Meyer
University of GrazWolfgang Helmberg
Cincinnati Children’s HospitalSusan ThompsonDavid Glass
University of TexasNishanth MarthandanPaula GuidryDavid KarpRichard Scheuermann
Children's Hospital Oakland Research Inst.Steven J. MackJill A. Hollenbach
Harvard Medical SchoolAlex Lancaster
UC BerkeleyGlenys Thomson
UC San FranciscoOwen Solberg
Roche Molecular SystemsHenry A. Erlich
Anthony Nolan Research Inst.Steven G.E. MarshMatthew Waller
NCBI/NIHMike Feolo
NGITJeff WiserPatrick DunnTom Smith