Click to edit Master title style
1
Characterising the human immunoglobulin heavy chain locus by ultra-deep sequencing
of rearranged immunoglobulin genes
Bruno GaëtaSchool of Computer Science and Engineering, UNSW
Click to edit Master title style
2
Immunoglobulin Rearrangement
Katherine Jackson
Click to edit Master title style
3
So what do we know about the process?
• Combinatorial diversity (VDJ)
• P-addition
• N-addition
• Exonuclease action
• Somatic hypermutation
Click to edit Master title style
4
Human heavy chain immunoglobulin variable region genes (IGH)
• IGHV
About 46 functional genes (7 families)
Up to 20 (?) reported alleles per gene
• IGHD
About 24 genes (7 families), 1-3 known alleles/gene
• IGHJ
6 functional genes, 1-4 known alleles/gene
Still very controversial!
Click to edit Master title style
5
The Immunoglobulin Factsbook (Lefrancand Lefranc 2001)
Click to edit Master title style
6
Click to edit Master title style
7
Characterizing variation in the IGH locus
• The Human Genome Project, HapMap and the 1000 Genomes Project have all ignored the IGH locus
• Conventional methods are difficult to apply to this locus
• Our approach focuses on mass sequencing of rearranged sequences
Click to edit Master title style
8
Blood sample Rearranged IGH gene sequences (VDJ)
Data generation
Data from Stanford University (Lyndon Zhang, Katherine Jackson, Scott D. Boyd, Andrew Z. Fire)
Multiplex PCR Sequencing (454)
Click to edit Master title style
9
Bioinformatics analysis
Rearranged IGH gene
sequences (VDJ)
HaplotypeGenotypeIdentify
germlinegenes
Draft genotype
iHMMune-align model
Click to edit Master title style
10
iHMMune-align
• Hidden Markov model of immunoglobulin rearrangement and diversification processes
• Designed to identify the most likely germlinegene segments in a rearranged Ig gene sequence and partition the sequence
• Can also be used to calculate the probability of a sequence originating from a specific germline gene
Click to edit Master title style
11
iHMMune-align HMM topology
Gaëta et al (2007) Bioinformatics 23:1580
Click to edit Master title style
12
Genotyping
• Find the combination of alleles most likely to generate the observed data
Click to edit Master title style
13
IGHV Genotyping
• Pre-align sequences (Vmatch) with the IGHV repertoire to filter out unlikely alleles (draft genotype)
• Calculate P(si|gn) using the iHMMune-align model
• Calculate likelihood of sequence set for each combination of alleles in the draft genotype
• Select most likely genotype
Click to edit Master title style
14
IGHD Genotyping
• IGHD genes very short and difficult to identify unambiguously: use a combination of iHMMune-align (with only IGHV alleles present in the genotype) and specific pattern searches
• Calculate P(si|gn) using a simplified iHMMune-align model
Click to edit Master title style
15
IGHJ Genotyping
• Similar to IGHV genotyping but with a simplified iHMMune-align model
Click to edit Master title style
16
Click to edit Master title style
17
Genotyping - evaluation
Click to edit Master title style
18
Once the genotype is determined…
• Re-identify germline genes in the sequence set, using only germline genes present in the genotype (iHMMune-align)
Click to edit Master title style
19
Determination of phased haplotypes
IGHV1-2*01 IGHJ6*02
Only possible for subjects heterozygous at the IGHJ4 or IGHJ6 loci
IGHV1-2*01 IGHJ6*02
IGHV1-2*04 IGHJ6*03
IGHV1-2*04 IGHJ6*03
IGHV1-2*04 IGHJ6*03
Click to edit Master title style
20
Automated classification5. Mult inomial Logist ic Regression for the Ident ificat ion of Immunoglobulin
Haplotypes
Figure 5.4: Classificat ion er ror rat es of di↵erent algor i t hms for I GH D
haplotyping. The error rates of using ‘Counts of Sequences’(CoS) ,‘Bino-
mial Probabilit ies’ (BP) and ‘Counts of Sequences Plus Binomial Probabilit ies’
(CoSBP) as attributes in the classificat ions were also compared.
respect ively. The classificat ion correctness given by di↵erent algorithms using
di↵erent at tributes is shown in Table 5.5.
Figure 5.4 compares the performance of Logist ic Regression, Linear Regres-
sion,SVM and Decision Tree. The logist ic regression using ‘Binomial Probabili-
t ies’(BP) as classificat ion attributes gave the best classificat ion.
Table 5.6 shows the di↵erence of haplotypes ident ified by manual and au-
tomatic haplotyping. Excellent agreement was observed between manually and
156
5. Mult inomial Logist ic Regression for the Ident ificat ion of Immunoglobulin
Haplotypes
Figure 5.3: Classificat ion er ror rat es of di↵erent algor i t hms for IGH V
haplotyping. Logist ic regression, linear regression and J48 decision tree’s per-
formances were compared. The error rates of using ‘Counts of Sequences’(CoS),
‘Binomial Probabilit ies’ (BP) and ‘Counts of Sequences Plus Binomial Probabil-
it ies’ (CoSBP) as att ributes in the classificat ions were also compared.
154
Click to edit Master title style
21
Click to edit Master title style
22
IGHD Haplotypes
Click to edit Master title style
23
IGHV Haplotypes
Ambiguity
Duplication
D Deletion
Click to edit Master title style
24
The team…
• BABS, UNSW
– Marie Kidd
– Yan Wang
– Mark Tanaka
– Andrew Collins
• CSE, UNSW
– Zhiliang Chen
– Bruno Gaëta
• Pathology, Stanford
– Lyndon Zhang
– Katherine Jackson
– Scott Boyd
– Andrew Fire