genome signatures of microbial organisms identified by amino acid n-gram analysis b. suman bharathi...
Post on 20-Dec-2015
214 Views
Preview:
TRANSCRIPT
GENOME SIGNATURES OF MICROBIAL ORGANISMS IDENTIFIED BY AMINO ACID N-GRAM ANALYSIS
B. Suman Bharathi
Advisor: Judith Klein-Seetharaman
Forschungszentrum, Juelich, Germany
Genome Signatures• Sequence peptides which occur with unusually high
frequency unlike others in particular organism or pathogen• Potential applications:
– Drug development: synthetize drugs which target genome signature in pathogen
– Sensor development: use genome signature to identify organism quickly using antibody
MPSEMPSE
MPSEMPSE MPSEMPSE
MPSEMPSE
MPSEMPSE
Neisseria meningitidis Homo sapiens
• Linguistic approach
• N-gram analysis using toolkit
• What the BLMT toolkit provides
• N-gram statistical analysis• Definition of signature
sequences• Use of toolkit on Neisseria
Meningitidis
Approach
Neisseria meningitidisversus other species
n=4
0
0.01
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.02
AAAAAAALSDGILAAAMPSEALAALAALAALAAVAAAALLAAAVGRLKLLAAEAAAAEAAAAEAALLAAAVAAVLAAAAE
n-gram = sequence of length n
Occ
urr
ence
of
n-g
ram
(%
)
Use of BLMT• N-gram statistical analysis gives us a detailed
statistical data in terms of frequency of n-grams and their respective mean and standard deviations.
• We have taken 45 organisms into consideration –bacteria, archaea, mycoplasmas and human
• Search for n-grams whose standard deviations are away from the mean values.
• Indicates the difference between expected and observed values in frequency of the n-grams.
• Eventually helps us to see the unsusuality of this n-gram in the organism unlike the others compared.
Difference Between Expected and Observed frequencies
Xylella(black)Vibrio(red)
Ureaplasma(green)Treponema(blue)
Thermotoga(yellow)
The positive values indicate the over-represented n-grams while the negative values indicate the under-represented n-grams
n-gram
Initial Points of difference between expected and observed frequency graph
Ureapasma shows high difference values (approx 0.00021), indicating over-representation of n-grams compared
to expected probability of occurence in the organism
Xylella(black)
Vibrio(red)
Ureaplasma(green)
Treponema(blue)
Thermotoga(yellow)
Standard deviation away from the mean
• Mycoplasma genitalium(black)
• M.tuberculosis(red)
• M.leprae(green)
• Mesorhizobium(blue)
• Lactococcus(yellow)
Shows distribution of n-gram standard deviations with both high and low values of difference, indicating the over-expressed and under-expressed n-gram values.
Mycoplasma genitalium(black)M.tuberculosis(red)M.leprae(green)
Mesorhizobium(blue)Lactococcus(yellow)
Highest standard deviations away from the mean
Shows initial (highest) values of standard deviation away from mean N-grams of M.tuberculosis much higher than M.leprae.
Mycoplasma genitalium(black)M.tuberculosis(red)M.leprae(green)
Mesorhizobium(blue)Lactococcus(yellow)
Comparison of genome size with varying standard deviations
• Examine the relationship between genome size and distribution of n-gram standard deviations for each organism
• Human genome taken as reference.• Compare genome size and standard
deviations within same genus but across different species.
Size Distribution of Genomes1.Human 22889476
2.Bacteria_Mesorhizobium_loti 4080256
3.Bacteria_Pseudomonas_aeruginosaPA01 3730192
4.baceria E_coi0157H7Baceria_Escherichia_coiO157H7 3229098
5.Bacteria_Escherichia_coliO157H7EDL933 3228100
6.Bacteria_Escherichia_coliK12 2726558
7.Bacteria_Mycobacterium_tuberculosisH37Rv 2666338
8.Bacteria_Bacillus_subtilis 2442200
9.Bacteria_Bacillus_halodurans_C125 2384352
10.Bacteria_SynechocystisPCC6803 2072748
11.Bacteria_Vibrio_cholerae_chr1 1725852
12.Bacteria_Deinococcus_radioduransR1_chr1 1559376
13.Bacteria_Xylella_fastidiosa 1490262
14.Archaea_Archaeoglobus_fulgidus 1343990
15.Bacteria_Pasteurella_multocida 1340102
16.Bacteria_Lactococcus_lactis_subsp_lactis 1335222
17.Archaea_Aeropyrum_pernix 1280062
18.B_Neisseria_meningitidis_serogroupBstrainMC58 1178096
19.Archaea_Halobacterium_spNRC1 1178038
20.B_Neisseria_meningitidis_serogroupAstrainZ2491 1176104
21.Bacteria_thermotoga_maritima 1167344
22.Bacteria_Pyrococcus_horikoshiiOT3 1141216
23.Bacteria_Mycobacterium_leprae_strinTN1080756
24.A_Methanobacterium_thermoautotrophicum_deltaH 1054752
25.Bacteria_Haemophilus_influenzaeRd 1045572
26.Bacteria_Campylobacter_jejuni 1020944
27.Bacteria_Helicobacter_pylori_strianJ99 990942
28.Bacteria_Helicobacter_pylori26695 986258
29.Archaea_Methanococcus_jannaschii 970558
30.Bacteriae_Aquifex_aeolicus 968068
31.Archaea_Thermoplasma_acidophilum 909164
32.Archaea_thermoplasma_volcanium 903228
33.Bacteria_Chlamydophila_pneumonieaeJ138 735350
34.Bacteria_Chlamydophila_pneumonieaCWL029 725492
35.Bacteria_Chlamydophila_pneumonieaeAR39 729896
36.Bacteria_Treponema_pallidum 703414
37.Bacteria_Chlamydia_muridarum 646712
38.Bacteria_Chlamydia_trachomatis 626142
39.Bacteria_Rickettsia_prowazekii_strain_MadridE 559828
40.Bacteria_Mycoplasma_pneumoniae 480870
41.Bacteria_Ureaplasma_urealyticum 457608
42.Bacteria_Buchnera_sp_APS 371470
43.mycoplasma genitalium 352826
44.Bacteria_Borrelia_burgdorferi 300106
Size genome graph and varying std deviation values
The organisms are listed in descending order of genome size.The relation between distribution of n-gram standard deviations
and size is compared.
•Human(black22889476)
•Mesorhizobium(red,4080256)
•P.aeruginosa(green,3730192)
•E_coi0157h7(blue,3229098)
•E_coli0157h7EDl933
(yellow,3228100)
Tail end of Genome size and n-gram distribution of standard deviations
Human genome, though largest in size, has low values of n-gram standard deviation values away from the mean
compared to smaller genomes
Human(black,22889476)Mesorhizobium(red,4080256)P.aeruginosa(green,3730192)E_coi0157h7(blue,3229098)
E_coli0157h7EDl933(yellow,3228100)
Initial points: Genome size and n-gram distribution of standard deviations
Human n-gram std deviation values are almost equal to Mesorhizobium though Mesorhizobium has much smaller genome.
Human(black,22889476)Mesorhizobium(red,4080256)P.aeruginosa(green,3730192)E_coi0157h7(blue,3229098)
E_coli0157h7EDl933 (yellow,3228100)
Genome size and n-gram distribution of standard deviations
M.tuberculosis has very high n-gram standard deviation values.It exceeds the values of human, despite its smaller genome size.
Human (black,22889476)
•E_coliK12(red,2726558)
•M.tuberculosis(green,2666338)
•B.subtilis(blue,2442200)
•B.halodurans(yellow,2384352)
•Synechocystis(brown,2072748)
Initial points of Genome size and n-gram distribution of standard deviations
The thickness of lines indicates the genome size.The thinnest line represents E_coliK12.
Mycobacterium tuberculosis shows highest values.
Human (black,22889476)E_coliK12(red,2726558)
M.tuberculosis(green,2666338)B.subtilis(blue,2442200)
B.halodurans(yellow,2384352)Synechocystis(brown,2072748)
Final points of Genome size and n-gram distribution of standard deviations
M.tuberculosis and all other organisms here have n-grams with higher difference values than human.
Human (black,22889476)E_coliK12(red,2726558)
M.tuberculosis(green,2666338)B.subtilis(blue,2442200)
B.halodurans(yellow,2384352)Synechocystis(brown,2072748)
Same genus / different species
• 4-grams in M. tuberculosis have much higher 4-gram standard deviations from mean than M. leprae
Mycobacterium
GAGG 179GGAG 175GNGG 102GGNG 79AAAA 68AGGA 65GTGG 58GGTG 55GDGG 46GGDG 42LAAA 37GSGG 32GGSG 31NGGA 30ALAA 29NGGN 29AGGN 26GVGG 25GGVG 25AALA 24VAAA 23
M. tuberculosis M. lepraeAAAA 47LAAA 39AALA 32AVAA 31AAAV 29ALAA 28VAAA 27VAAL 26AELA 26AAVA 25LAAL 25ELAA 24LAGL 22AAAL 22TAAA 22LAEL 21
Other OrganismsHuman Thermotoga
maritimaEEEE 107PPPP 95AAAA 89SSSS 86GGGG 63LLLL 59QQQQ 55HTGE 47GPPG 46GEKP 40TGEK 39EKPY 34ECGK 32PPGP 32PGPP 31KKKK 30AMAA 26RSRS 25CGKA 25EEED 24GKAF 23EDEE 22IHTG 22PPAP 22DEEE 21
AMKK 31EAMK 28LKEK 28LEEI 26EILK 25GKTT 24LEEL 24EILE 24EKLK 23EELK 23LEKL 23EALK 22KALE 22EEIE 22LKKL 22LLEK 21
QAIA 64LAIA 63TAIA 61GDRL 59AIAA 49EAIA 47GDRQ 46AIAV 44AAIA 42AIAK 39AIAL 39GAIA 36VAIA 36AIAD 30AIAS 29EPEP 27AIAG 27PEPE 27AIAE 26AIAI 26KAIA 23AIAR 23LGDR 22MAIA 22
Synechocystis spec.
SDGI 55MPSE 50AAAA 49GRLK 34AAAL 32LAAA 26AVAA 24ALAA 24AAAV 23FQTA 23AAEA 23EAAA 22QTAL 21AVAM 21
Neisseria meningitidis
LTAL 75KSAV 45TALL 40AMKK 37TALS 32SAVK 31KAMK 30ESAV 30STAL 28SAVE 27KKAM 27TALF 26LSGG 22QSAV 21KLTA 21GKST 21
Haemophilus influenza
Conclusions • n-grams which are at least 30 standard deviations away from
the mean are significant candidates for genome signatures.• Difference graphs: estimate the likelihood of n-gram
observed in an organism.• Genome size graphs : there is no specific relationship
between the size of genome and its standard deviation values.
• Same genus and different species, where genome size is specified: There is a noticeable difference observed between Mycobacterium species (M.leprae and M.tuberculosis).
Current and future work
• Find n-gram signatures n-grams in E.coli.• Explore the relationship between genome size and
distribution of n-gram standard deviations different species of the same organism.
• Find more specific targets to differentiate species in terms of signature peptides for all the 44 organisms taken for study.
top related