mhc and the 1000 genomes: genotyping from exome data · expecting that for correct typing the exons...

1
Introduction Genotyping of HLA genes is particularly challenging since these genes are exceptionally polymorphic - the IMGT/HLA database already contains 10K+ reference alleles for only 34 genes. For NGS reads a specific HLA typing algorithm is needed to find the correct HLA type for the genes, which are highly homologous and alleles can be very similar. Our HLA typing algorithm first maps and aligns these reads to all the references in the database, and based on read coverage statistics, it is choosing the best matching allele pairs. Since we are examining only the coding regions of the genes, during the typing phase the most important measure is the extent of coverage for each exon. We are expecting that for correct typing the exons (exon 2 and 3 for MHC-I genes and exon 2 for MHC-II) are almost completely covered by reads. Earlier we demonstrated for HapMap samples using Illumina reads from the 1000 Genomes data repository that it is possible to estimate HLA types for MHC-I genes with more than 90% concordance in four digits resolution. Now we are presenting results for further 67 samples validated by Sanger sequencing in higher resolution (6 digits) and also for some MHC-II genes (HLA-DRB1 and HLA-DQB1). Technically all the genes in the IMGT/HLA database can be typed as we are showing for MICA, MICB and TAP1, TAP2. Methods and Results Data from the 1000 Genomes repository (paired Illumina reads) were first filtered to find the subset of reads that are mapping to the IMGT/HLA database references. Only this filtered subset was used for further typing. e details of the typing algorithm can be found in our previous publication (Major et. al. 2013). Concordance values (see Table 1) for HLA-A, -B and -C genes are similar though slightly higher as in our previous study. As the previous study involved lower quality data and the typing algorithm was also improved, it is reassuring that the results are concordant for six digits resolution. Furthermore, MHC-II genes (HLA-DRB1 and HLA-DQB1) could also be typed although many of these are defined only for exon 2 in the database. In case of a larger study it is likely that we are running into new alleles: a nice example is NA19159, where apparently we have a new unknown allele. Our assignment as DRB1*13:01:01 differs from the validation value DRB1*13:35, and the alignment reveals that it is more likely a novel allele (Fig. 1). Since MICA, MICB and TAP1, TAP2 are less polymorphic as the classical HLA genes, we can have a nice coverage, many times even for intronic regions (Fig. 2). Although we do not have validation values for these, the genotyping results are robust (no quality control issues) and are bearing lile ambiguity. Some of the 1000 Genomes whole exome samples were targeting only a subset of the CCDS: although the filtering procedure selected a number of reads from the original samples, reliable HLA typing was possible only from whole-exome sequencing experiments, where the whole CCDS was targeted. Conclusions • It is possible to get high resolution (six digits) HLA types from whole exome experiments • e main reason for mistyping or allele dropout is missing reads from exon 2 or exon 3 • Besides HLA-A, HLA-B and HLA-C, other important genes are available for typing such as HLA-DRB1 and HLA-DQB1 from the MHC-II region it is possible to find novel alleles using whole exome sequencing reads • typing non-classical MHC genes like MICA, MICB and TAP1, TAP2 is also quite likely from NGS reads, though we need validation by other methods the targeting kit has to cover the MHC-region: it is possible to filter reads that can be mapped and aligned to the IMGT/HLA database, but these reads are not usable for real typing since they are mosty covering conserved regions References 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature. 2010 Oct 28;467(7319):1061-73. doi: 10.1038/nature09534. Major E, Rigó K, Hague T, Bérces A, Juhos S: HLA typing from 1000 genomes whole genome and whole exome Illumina data. PLoS One. 2013 Nov 6;8(11):e78410. doi: 10.1371/journal. pone.0078410. eCollection 2013. MHC AND THE 1000 GENOMES: GENOTYPING FROM EXOME DATA Sz. Juhos 1 , K. Rigo 1 , P .-A. Gourraud 2 , Gy. Horvath 1 1 Omixon Biocomputing Ltd, Budapest, Hungary 2 University of California San Francisco, CA, United States Corresponding author: [email protected] Coverage of MICB alleles Fig 2: Coverage of MICB alleles. Reads are filtered from whole exome sequencing experiments. Exons are evenly covered, together with some intronic parts Concordance values for MHC-I and MHC-II genes Gene Number of mistypings from ~67 samples, (not all the samples passed QC) Concordance HLA-A 3 98,00% HLA-B 5 95,90% HLA-C 8 93,55% HLA-DQB1 3 98,00% HLA-DRB1 2 98,51% Table 1: Concordance values for MHC-I and MHC-II genes. The validation values were obtained by Sanger sequencing based HLA typing A putative new HLA-DRB1 allele Fig 1: A putative new HLA-DRB1 allele typed as DRB1*13:01:01 by our method and as DRB1*13:35 by capillary sequencing Omixon_90x150cm_tabla02.indd 1 2014.05.27. 17:01

Upload: others

Post on 12-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MHC AND THE 1000 GENOMES: GENOTYPING FROM EXOME DATA · expecting that for correct typing the exons (exon 2 and 3 for MHC-I genes and exon 2 for MHC-II) are almost completely covered

Introduction

Genotyping of HLA genes is particularly challenging since these genes are exceptionally polymorphic - the IMGT/HLA database already contains 10K+ reference alleles for only 34 genes. For NGS reads a specific HLA typing algorithm is needed to find the correct HLA type for the genes, which are highly homologous and alleles can be very similar.

Our HLA typing algorithm first maps and aligns these reads to all the references in the database, and based on read coverage statistics, it is choosing the best matching allele pairs. Since we are examining only the coding regions of the genes, during the typing phase the most important measure is the extent of coverage for each exon. We are expecting that for correct typing the exons (exon 2 and 3 for MHC-I genes and exon 2 for MHC-II) are almost completely covered by reads.

Earlier we demonstrated for HapMap samples using Illumina reads from the 1000 Genomes data repository that it is possible to estimate HLA types for MHC-I genes with more than 90% concordance in four digits resolution. Now we are presenting results for further 67 samples validated by Sanger sequencing in higher resolution (6 digits) and also for some MHC-II genes (HLA-DRB1 and HLA-DQB1). Technically all the genes in the IMGT/HLA database can be typed as we are showing for MICA, MICB and TAP1, TAP2.

Methods and ResultsData from the 1000 Genomes repository (paired Illumina reads) were first filtered to find the subset of reads that are mapping to the IMGT/HLA database references. Only this filtered subset was used for further typing. The details of the typing algorithm can be found in our previous publication (Major et. al. 2013).

Concordance values (see Table 1) for HLA-A, -B and -C genes are similar though slightly higher as in our previous study. As the previous study involved lower quality data and the typing algorithm was also improved, it is reassuring that the results are concordant for six digits resolution. Furthermore, MHC-II genes (HLA-DRB1 and HLA-DQB1)

could also be typed although many of these are defined only for exon 2 in the database.

In case of a larger study it is likely that we are running into new alleles: a nice example is NA19159, where apparently we have a new unknown allele. Our assignment as DRB1*13:01:01 differs from the validation value DRB1*13:35, and the alignment reveals that it is more likely a novel allele (Fig. 1).

Since MICA, MICB and TAP1, TAP2 are less polymorphic as the classical HLA genes, we can have a nice coverage, many times even for intronic regions (Fig. 2). Although we do not have validation values for these, the genotyping results are robust (no quality control issues) and are bearing little ambiguity.

Some of the 1000 Genomes whole exome samples were targeting only a subset of the CCDS: although the filtering procedure selected a number of reads from the original samples, reliable HLA typing was possible only from whole-exome sequencing experiments, where the whole CCDS was targeted.

Conclusions

• It is possible to get high resolution (six digits) HLA types from whole exome experiments

• The main reason for mistyping or allele dropout is missing reads from exon 2 or exon 3

• Besides HLA-A, HLA-B and HLA-C, other important genes are available for typing such as HLA-DRB1 and HLA-DQB1 from the MHC-II region

• it is possible to find novel alleles using whole exome sequencing reads

• typing non-classical MHC genes like MICA, MICB and TAP1, TAP2 is also quite likely from NGS reads, though we need validation by other methods

• the targeting kit has to cover the MHC-region: it is possible to filter reads that can be mapped and aligned to the IMGT/HLA database, but these reads are not usable for real typing since they are mosty covering conserved regions

References

• 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature. 2010 Oct 28;467(7319):1061-73. doi: 10.1038/nature09534.

• Major E, Rigó K, Hague T, Bérces A, Juhos S: HLA typing from 1000 genomes whole genome and whole exome Illumina data. PLoS One. 2013 Nov 6;8(11):e78410. doi: 10.1371/journal.pone.0078410. eCollection 2013.

MHC AND THE 1000 GENOMES:

GE NOTYP ING FROM EXOME DATASz. Juhos1, K. Rigo1, P.-A. Gourraud2, Gy. Horvath1

1 Omixon Biocomputing Ltd, Budapest, Hungary

2 University of California San Francisco, CA, United States

Corresponding author:

[email protected]

Coverage of MICB alleles

F ig 2 : Coverage o f M ICB a l l e l es. Reads a re f i l t e red f rom who le exome sequenc ing exper iments. Exons a re even ly covered, toge ther w i th some in t ron ic pa r ts

Concordance values for MHC-I and MHC-II genes

Gene Number of mistypings from ~67 samples, (not all the samples passed QC) Concordance

HLA-A 3 98,00%

HLA-B 5 95,90%

HLA-C 8 93,55%

HLA-DQB1 3 98,00%

HLA-DRB1 2 98,51%

Table 1 : Concordance va lues fo r MHC- I and MHC- I I genes. The va l i da t i on va lues were ob ta ined by Sanger sequenc ing based HLA typ ing

A putative new HLA-DRB1 allele

F ig 1 : A pu ta t i ve new HLA-DRB1 a l l e l e t yped as DRB1*13 :01 :01 by our method and as DRB1*13 :35 by cap i l l a ry sequenc ing

Omixon_90x150cm_tabla02.indd 1 2014.05.27. 17:01