Identification of human-to-human transmissibility factors
in PB2 proteins of influenza A by large-scale mutual information analysis
Sixth International Conference on Bioinformatics (InCoB2007) Hong Kong, 28th August 2007
Olivo Miotto
Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore
AT Heiny Tan Tin Wee J Thomas August Vladimir BrusicYong Loo Lin School of Medicine Johns Hopkins University Cancer Vaccine Center National University of Singapore School of Medicine Dana-Farber Cancer Inst.
Page 2
Outline
Background
Mutual Information Analysis
Materials and Methods
Results
Discussion and conclusions
Page 3
Outline
Background
Mutual Information Analysis
Materials and Methods
Results
Discussion and conclusions
1
Page 4
Avian Flu: is The Pandemic coming?
Can H5N1 viruses spread amongst humans?
Page 5
1918-1919
Page 6
NANeuraminidase
9 subtypes
HAHaemagglutinin
16 subtypesViral RNA
Matrixprotein
The Influenza A Virus
SerologicalSubtyping:
http://www.roche.com/pages/facets/10/viruse.htm
Page 7
Avian vs Human Influenza
Wild Waterfowl- Natural pool
- Over 100 subtypes observed
- Affects the digestive tract
- Often asymptomatic
Humans- Only 4 subtypes transmitted human-to-human (H2H)
- Avian-to-human (A2H) infection in small number of subtypes
- Affects the respiratory tract
Page 8
Influenza Circulation
Wild Waterfowl
Avian-to-Avian
(A2A)
> 100 subtypes
Humans
Human-to-Human
(H2H)
only 4 subtypes
Domestic Poultry
Swine
cf. Webster RG et al. (1992). Microbiol Rev. 1992, 56(1), 152-179.
Page 9
Avian origins of pandemic strains
From: Belshe RB (2005) N Engl J Med. 2005;353:2209-2211.
Page 10
Pressing Questions
What are the mechanisms of adaptation to human hosts?
Which genes/products are involved? Can we identify mutations responsible for the
capability to infect humans? Can we identify mutations responsible for
adaptation to human-to-human transmission? Can we elucidate the role of such mutations? Can we assess the pandemic potential of current
H5N1 (and/or other strains)?
Page 11
Study goals
Analyze all influenza protein sequence data available Historical data Whole Genome
Use statistical approaches to identify amino sites that characterize H2H transmissibility Compare H2H with non-H2H (A2A) Create an "adaptation map"
Use the information acquired to characterize individual isolates and strain evolution Map out the emergence of characteristic mutations Assess a strain's potential for H2H transmissibility
Page 12
Why PB2?
Initial study performed on PB2 proteinInternal protein, component of RNPSome experimentally determined functional regions
Well-known E627K mutation involvement in mammalian and cold-temperature adaptation
From: http://www.omedon.co.uk/influenza/influenza/
Subbarao EK, London W, Murphy BR (1993) J Virol, 67(4), 1761-1764.
Page 13
Outline
Background
Mutual Information Analysis
Materials and Methods
Results
Discussion and conclusions
2
Page 14
Information Theory
Information Entropy H is a measure of uncertainty
where e is an event from a possible set E, and pe is the probability of e occurring
Lower entropy -> more predictable outcome Entropy is affected by
the number of outcomes their relative probabilities
Shannon CE (1948) Bell System Tech J, 27: 379-423,623-656.
Page 15
Entropy in multiple alignments
In a multiple sequence alignment, we can treat each alignment site as a separate "variable" Each observed residue at that site as a separate "event" The "event probability" as the percentage of sequences
in the alignment that contain the residue
H = 0 at fully conserved positions Single, 100% predictable outcome
H increases when several residues are observed at the same position, and/or
their probability is evenly distributed
Page 16
Entropy is a measure of diversity
Both full sequences and sequence fragments can be used in entropy computation
Entropy of Influenza A PB2 protein
based on alignment of 3132 sequences
Page 17
Entropy in Sequence Alignments
S G W K E E L A V N Q P V Q E F E T F E I EW E E K E E F A V Y I P L Q P F L T F G R LG E S P E E N F V N V P H Q Y F Y T V E P MG E S L E E A S V N G P F Q Y F Y T V E C LW E S K E E N A V N V P H Q K F F T V L T MT E N P E E E L F K V P F R V F F S L S H YK E T N E E P W F K K P M R E F Y S A W G LG E T N E E E A F N V P R R V F F S V S N LG E K N E E E A F K L P F R E F Y S V Q R VE E Q S E S A E S Q Q P E E P F Y Q I L E LG E Q V E S S E S Q E P H E E F Y Q I R T LG E K Q E S S S S Y E P K E E F A Q C V L LR E A Q E S Q A S N V P M E T F Y Q V R T LH E R V E S A A S N V P M E T F Y Q I A E LR E C H E V K A Q Y V P M L E F Y Q V K P WG E S S E V A A Q N V P M L W F Y Q R H V MG E A S E V E H Q N V P H L K F Y Q E G P P
M MZ Z ZH H
Z = zero entropy H = high entropyM = medium entropy
Page 18
Comparing Alignments
G E T N E E E A F N V P R R V F F S V S N LG E T N E E E A F N V P R R V F F S V S N IG E T N E E E A W N V P R R V F F S I S N LG E T N E E E A F N V P R R V F F S V S N LG E T N E E E A F N V P R R V F F S V S N IG E T N E E E A F N V P R R V F F S I S N LG E T N E E E A W N V P R R V F F S V S S LG E T N E E E A F N V P R R V F F S V S S L
G E V N E D E A F N V P R R V F F S A S N LG E V N E D E A F N V P R R V F F S A S S IG E V N E D E A F N V P R R V F F S A S N LG E V N E D E A F N V P R R V F F S A S N IG E G N E D E A F S V P R R V F F S A S N IG E G N E D E A F S V P R R V F F S A S S IG E G N E D E A F S V P R R V F F S A S N IG E G N E D E A F S V P R R V F F S A S N L
AVIAN
sequences
HUMANsequences
C
C = characteristic sites
C
Z = zero entropy
Z Z
N = non-characteristic
N N
Page 19
Mutual Information
Mutual Information (MI) uses information entropy to measure relationship between two variablesThe higher the MI, the more information about variable A
can be obtained by knowing the value of variable B
where H(A) and H(B) are entropies of A and B,
and H(A,B) is the joint entropy of A and B
Joint entropy is computed by considering eachcombination of the two variables as a separate outcome
Page 20
Using MI to detect Characteristic Sites
At a characteristic site, the residue observed is strongly associated to a set of sequencesE.g. : Arg -> Avian Thr -> Human
This association is explored by measuring MI of The residue observed at a site The label of the set in which it is observed
MI is in range 0 – 1.0MI = 0.0 -> no statistical significance in the occurrence
of residues in the two sets
MI = 1.0 -> Residues observed in one set are never observed in the other, and vice versa
Page 21
A2A (719 sequences)
H2H (1650 sequences)
PB2 Protein
PB2 Protein
MI
Entropy
Spikes indicate characteristic sites
Page 22
Outline
Background
Mutual Information Analysis
Materials and Methods
Results
Discussion and conclusions
3
Page 23
The Antigenic Variability Analyzer (AVANA)
Page 24
Source Sequences
Comprehensive set of PB2 proteins 3,132 protein sequences with accompanying metadata:
Host
Subtype
Country of isolation
Year of Isolation
Extracted from NCBI Protein and Nucleotide databases(all proteins > 40,000 sequences)
Automated aggregation, metadata extraction and metadata cleaning - using the ABK software
Multiple sequence alignment (MSA) using Muscle 3.6 Manually verified and corrected metadata and MSA
Page 25
Datasets
Three subsets produced for comparison A2A
Avian sequences for all subtypes, except those that circulate amongst humans (H1N1, H2N2, H3N2, H1N2) and H5N1
H1N1HHuman sequences for H1N1
HxN2HHuman sequences for H2N2, H3N2, H1N2
To retain alignment, subsets are extracted from single MSA
H1N1 and HxN2 are separate co-circulating lineages
Webster RG et al. (1992). Microbiol Rev. 56(1), 152-179.
Page 26
Identification of characteristic sites
Compare each of H1N1H, HxN2H against A2A1. Pick sites with high MI (>0.4)
2. Identify characteristic variants of human transmission:At least 4x more frequent in human than in avian set
Appear in at least 2% of human sequences
3. Identify avian characteristic variants
4. Discard site if >5% human sequences contain avian variantsAll sites with >2% avian variants were verified by hand
Merge catalogues of sites for H1N1H and HxN2HKeep only sites that are present in both catalogues
Page 27
Outline
Background
Mutual Information Analysis
Materials and Methods
Results
Discussion and conclusions
4
Page 28
Results: 17 characteristic sites
A2A H2H A2A H2H9 DE NT 1933 98.57% 99.33% 0.49%
44 A S 1940 96.82% 99.27% 0.61%64 M T 1933 97.29% 99.58% 0.30%81 T MV 1933 97.93% 99.27% 0.30%
105 TA VM 1933 98.41% 99.45% 0.36%199 A S 1918 99.47% 99.76% 0.24%
271 TI A 1940 98.59% 99.51% 0.37%292 IV T 1940 95.54% 99.15% 0.67%368 R K 1940 98.12% 99.33% 0.67%475 L M 1918 99.66% 99.76% 0.24%567 DE N 1918 98.28% 99.39% 0.55%
588 AV I 1940 98.45% 99.63% 0.31%613 VA TI 1940 98.28% 99.32% 0.61%627 E K 1918 99.31% 99.76% 0.12%
661 A T 1933 86.72% 99.39% 0.43%674 AS T 1933 95.69% 99.63% 0.18%702 K R 1918 89.70% 99.39% 0.49%
Conservation X-presence of A2A
PositionChar. Variants 1st Human
isolateNaffakh
2000
Chen
2006
Chen GW et al. (2006) Emerg Infect Dis 12(9), 1353-1360. Naffakh N et al. (2000). J Gen Virol, 81, 1283-1291.
Page 29
Functional Atlas of PB2 Adaptations
9 44 64 81 105 199 271 292 368 475 613 627 661 674567 588 702
DE M TITA IVA T A LR AE ASVAAV KDE
NT T AVM TS MV S MK TK TTII RN
Nuclear Localization
Signal
PB1binding
NPbinding
RNA capbinding
A2A
H2H
http://www-micro.msb.le.ac.uk/3035/Orthomyxoviruses.html
Page 30
Reconstructing adaptation timelines
Characteristic sites can show "adaptation signature"A summary of mutations necessary for H2H adaptation
We can then characterize any PB2 sequence at these sites
Spanish Flu - H1N1A/Brevig Mission/1/1918
Page 31
H1N1
1918-1957
H2N2
1957-1968
H3N2
1968-now
1940s: Fully H2H Signature
1918: Mostly Avian Signature
1957, 1968: No disruption by
pandemics: no introduction
of avian PB2 protein
Remarkable stability,
to present day
Human Timeline over 3 pandemics
Sporadic avian/swine infections
Page 32
Swine Influenza Timeline
Evidence of avian and human mutations
Supports role of Swine
as “mixing vessel”
Page 33
H5N1: Timeline 1997-2006
Presents H2H mutations more frequently than other
avian strains
H2H mutations usually do not persist
H5N1 not “becoming” H2H
Page 34
Outline
Background
Mutual Information Analysis
Materials and Methods
Results
Discussion and conclusions5
Page 35
Discussion: Methodology
Detection of characteristic sites by MI has greater resolving power than previous approachesAllows multiple characteristic variants at a site
MI method allows large-scale analysisThousands of sequences, strong support for findings
Fragments can also be used too
Sequence signatures are effective for recapitulating strain characteristics and understanding trends
Good metadata is necessary for quality analysisLuckily, this is largely available for Influenza
Other viruses have poorer coverage
Page 36
Discussion: Human Sequences
H2H variants show remarkable historical stabilityResilience to HA and NA changes suggests limited interplay
in adaptation between internal and external proteins
Location of characteristic sites in binding domains suggests complex interactions are involved in adaptation to H2H transmissionCataloguing characteristic sites in other RNP proteins may
shed new light on their roles
Both current lineages of PB2 (H1N1, HxN2) have evolved from the same source (1918 Spanish Flu)No evidence of PB2 interchange between the two lineages
Page 37
Discussion: Avian Sequences
Avian strains rarely show any H2H mutation77% contain none (H5N1 excluded)
Only one sequence had 3 out of 17 mutation
Spanish Flu had 5 H2H mutationsCould be the minimum set, probably not optimal
H5N1 repeatedly exhibits H2H mutations, but they do not “stick”May account for its ability to jump the species barrier
May indicate that H5N1 PB2 is far from suited for H2H
Even the E627K mutation was not conserved
Reassortment is still possible- but how pathogenic?
Page 38
Future Developments
Full Catalogue of Influenza Characteristic SitesPreliminary results:
Characterization of subgroups of Influenza
Application of the method to other viruses
Release of AVANA tool
NP 18 M1 3PA 19 M2 10
PB1 1 NS1 9PB1-F2 3 NS2 3
Characteristic site count
Page 39
Acknowledgements and Thanks
Institute of Systems Science, NUSFunding support for this conference
Asif M Khan
KN SrinivasanTesting and feedback on AVANA tool
Partial Grant Support:
National Institute of Allergy and Infectious Diseases, NIHGrant No. 5 U19 AI56541, Contract No. HHSN2662-00400085C
ImmunoGrid ProjectEC Contract FP6-2004-IST-4, No. 028069