genome signatures of microbial organisms identified by amino acid n-gram analysis b. suman bharathi...

21
GENOME SIGNATURES OF MICROBIAL ORGANISMS IDENTIFIED BY AMINO ACID N-GRAM ANALYSIS B. Suman Bharathi Advisor: Judith Klein-Seetharaman Forschungszentrum, Juelich, Germany

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

GENOME SIGNATURES OF MICROBIAL ORGANISMS IDENTIFIED BY AMINO ACID N-GRAM ANALYSIS

B. Suman Bharathi

Advisor: Judith Klein-Seetharaman

Forschungszentrum, Juelich, Germany

Genome Signatures• Sequence peptides which occur with unusually high

frequency unlike others in particular organism or pathogen• Potential applications:

– Drug development: synthetize drugs which target genome signature in pathogen

– Sensor development: use genome signature to identify organism quickly using antibody

MPSEMPSE

MPSEMPSE MPSEMPSE

MPSEMPSE

MPSEMPSE

Neisseria meningitidis Homo sapiens

• Linguistic approach

• N-gram analysis using toolkit

• What the BLMT toolkit provides

• N-gram statistical analysis• Definition of signature

sequences• Use of toolkit on Neisseria

Meningitidis

Approach

Neisseria meningitidisversus other species

n=4

0

0.01

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.02

AAAAAAALSDGILAAAMPSEALAALAALAALAAVAAAALLAAAVGRLKLLAAEAAAAEAAAAEAALLAAAVAAVLAAAAE

n-gram = sequence of length n

Occ

urr

ence

of

n-g

ram

(%

)

Use of BLMT• N-gram statistical analysis gives us a detailed

statistical data in terms of frequency of n-grams and their respective mean and standard deviations.

• We have taken 45 organisms into consideration –bacteria, archaea, mycoplasmas and human

• Search for n-grams whose standard deviations are away from the mean values.

• Indicates the difference between expected and observed values in frequency of the n-grams.

• Eventually helps us to see the unsusuality of this n-gram in the organism unlike the others compared.

Difference Between Expected and Observed frequencies

Xylella(black)Vibrio(red)

Ureaplasma(green)Treponema(blue)

Thermotoga(yellow)

The positive values indicate the over-represented n-grams while the negative values indicate the under-represented n-grams

n-gram

Initial Points of difference between expected and observed frequency graph

Ureapasma shows high difference values (approx 0.00021), indicating over-representation of n-grams compared

to expected probability of occurence in the organism

Xylella(black)

Vibrio(red)

Ureaplasma(green)

Treponema(blue)

Thermotoga(yellow)

Standard deviation away from the mean

• Mycoplasma genitalium(black)

• M.tuberculosis(red)

• M.leprae(green)

• Mesorhizobium(blue)

• Lactococcus(yellow)

Shows distribution of n-gram standard deviations with both high and low values of difference, indicating the over-expressed and under-expressed n-gram values.

Mycoplasma genitalium(black)M.tuberculosis(red)M.leprae(green)

Mesorhizobium(blue)Lactococcus(yellow)

Highest standard deviations away from the mean

Shows initial (highest) values of standard deviation away from mean N-grams of M.tuberculosis much higher than M.leprae.

Mycoplasma genitalium(black)M.tuberculosis(red)M.leprae(green)

Mesorhizobium(blue)Lactococcus(yellow)

Comparison of genome size with varying standard deviations

• Examine the relationship between genome size and distribution of n-gram standard deviations for each organism

• Human genome taken as reference.• Compare genome size and standard

deviations within same genus but across different species.

Size Distribution of Genomes1.Human 22889476

2.Bacteria_Mesorhizobium_loti 4080256

3.Bacteria_Pseudomonas_aeruginosaPA01 3730192

4.baceria E_coi0157H7Baceria_Escherichia_coiO157H7 3229098

5.Bacteria_Escherichia_coliO157H7EDL933 3228100

6.Bacteria_Escherichia_coliK12 2726558

7.Bacteria_Mycobacterium_tuberculosisH37Rv 2666338

8.Bacteria_Bacillus_subtilis 2442200

9.Bacteria_Bacillus_halodurans_C125 2384352

10.Bacteria_SynechocystisPCC6803 2072748

11.Bacteria_Vibrio_cholerae_chr1 1725852

12.Bacteria_Deinococcus_radioduransR1_chr1 1559376

13.Bacteria_Xylella_fastidiosa 1490262

14.Archaea_Archaeoglobus_fulgidus 1343990

15.Bacteria_Pasteurella_multocida 1340102

16.Bacteria_Lactococcus_lactis_subsp_lactis 1335222

17.Archaea_Aeropyrum_pernix 1280062

18.B_Neisseria_meningitidis_serogroupBstrainMC58 1178096

19.Archaea_Halobacterium_spNRC1 1178038

20.B_Neisseria_meningitidis_serogroupAstrainZ2491 1176104

21.Bacteria_thermotoga_maritima 1167344

22.Bacteria_Pyrococcus_horikoshiiOT3 1141216

23.Bacteria_Mycobacterium_leprae_strinTN1080756

24.A_Methanobacterium_thermoautotrophicum_deltaH 1054752

25.Bacteria_Haemophilus_influenzaeRd 1045572

26.Bacteria_Campylobacter_jejuni 1020944

27.Bacteria_Helicobacter_pylori_strianJ99 990942

28.Bacteria_Helicobacter_pylori26695 986258

29.Archaea_Methanococcus_jannaschii 970558

30.Bacteriae_Aquifex_aeolicus 968068

31.Archaea_Thermoplasma_acidophilum 909164

32.Archaea_thermoplasma_volcanium 903228

33.Bacteria_Chlamydophila_pneumonieaeJ138 735350

34.Bacteria_Chlamydophila_pneumonieaCWL029 725492

35.Bacteria_Chlamydophila_pneumonieaeAR39 729896

36.Bacteria_Treponema_pallidum 703414

37.Bacteria_Chlamydia_muridarum 646712

38.Bacteria_Chlamydia_trachomatis 626142

39.Bacteria_Rickettsia_prowazekii_strain_MadridE 559828

40.Bacteria_Mycoplasma_pneumoniae 480870

41.Bacteria_Ureaplasma_urealyticum 457608

42.Bacteria_Buchnera_sp_APS 371470

43.mycoplasma genitalium 352826

44.Bacteria_Borrelia_burgdorferi 300106

Size genome graph and varying std deviation values

The organisms are listed in descending order of genome size.The relation between distribution of n-gram standard deviations

and size is compared.

•Human(black22889476)

•Mesorhizobium(red,4080256)

•P.aeruginosa(green,3730192)

•E_coi0157h7(blue,3229098)

•E_coli0157h7EDl933

(yellow,3228100)

Tail end of Genome size and n-gram distribution of standard deviations

Human genome, though largest in size, has low values of n-gram standard deviation values away from the mean

compared to smaller genomes

Human(black,22889476)Mesorhizobium(red,4080256)P.aeruginosa(green,3730192)E_coi0157h7(blue,3229098)

E_coli0157h7EDl933(yellow,3228100)

Initial points: Genome size and n-gram distribution of standard deviations

Human n-gram std deviation values are almost equal to Mesorhizobium though Mesorhizobium has much smaller genome.

Human(black,22889476)Mesorhizobium(red,4080256)P.aeruginosa(green,3730192)E_coi0157h7(blue,3229098)

E_coli0157h7EDl933 (yellow,3228100)

Genome size and n-gram distribution of standard deviations

M.tuberculosis has very high n-gram standard deviation values.It exceeds the values of human, despite its smaller genome size.

Human (black,22889476)

•E_coliK12(red,2726558)

•M.tuberculosis(green,2666338)

•B.subtilis(blue,2442200)

•B.halodurans(yellow,2384352)

•Synechocystis(brown,2072748)

Initial points of Genome size and n-gram distribution of standard deviations

The thickness of lines indicates the genome size.The thinnest line represents E_coliK12.

Mycobacterium tuberculosis shows highest values.

Human (black,22889476)E_coliK12(red,2726558)

M.tuberculosis(green,2666338)B.subtilis(blue,2442200)

B.halodurans(yellow,2384352)Synechocystis(brown,2072748)

Final points of Genome size and n-gram distribution of standard deviations

M.tuberculosis and all other organisms here have n-grams with higher difference values than human.

Human (black,22889476)E_coliK12(red,2726558)

M.tuberculosis(green,2666338)B.subtilis(blue,2442200)

B.halodurans(yellow,2384352)Synechocystis(brown,2072748)

Same genus / different species

• 4-grams in M. tuberculosis have much higher 4-gram standard deviations from mean than M. leprae

Mycobacterium

GAGG 179GGAG 175GNGG 102GGNG 79AAAA 68AGGA 65GTGG 58GGTG 55GDGG 46GGDG 42LAAA 37GSGG 32GGSG 31NGGA 30ALAA 29NGGN 29AGGN 26GVGG 25GGVG 25AALA 24VAAA 23

M. tuberculosis M. lepraeAAAA 47LAAA 39AALA 32AVAA 31AAAV 29ALAA 28VAAA 27VAAL 26AELA 26AAVA 25LAAL 25ELAA 24LAGL 22AAAL 22TAAA 22LAEL 21

Other OrganismsHuman Thermotoga

maritimaEEEE 107PPPP 95AAAA 89SSSS 86GGGG 63LLLL 59QQQQ 55HTGE 47GPPG 46GEKP 40TGEK 39EKPY 34ECGK 32PPGP 32PGPP 31KKKK 30AMAA 26RSRS 25CGKA 25EEED 24GKAF 23EDEE 22IHTG 22PPAP 22DEEE 21

AMKK 31EAMK 28LKEK 28LEEI 26EILK 25GKTT 24LEEL 24EILE 24EKLK 23EELK 23LEKL 23EALK 22KALE 22EEIE 22LKKL 22LLEK 21

QAIA 64LAIA 63TAIA 61GDRL 59AIAA 49EAIA 47GDRQ 46AIAV 44AAIA 42AIAK 39AIAL 39GAIA 36VAIA 36AIAD 30AIAS 29EPEP 27AIAG 27PEPE 27AIAE 26AIAI 26KAIA 23AIAR 23LGDR 22MAIA 22

Synechocystis spec.

SDGI 55MPSE 50AAAA 49GRLK 34AAAL 32LAAA 26AVAA 24ALAA 24AAAV 23FQTA 23AAEA 23EAAA 22QTAL 21AVAM 21

Neisseria meningitidis

LTAL 75KSAV 45TALL 40AMKK 37TALS 32SAVK 31KAMK 30ESAV 30STAL 28SAVE 27KKAM 27TALF 26LSGG 22QSAV 21KLTA 21GKST 21

Haemophilus influenza

Conclusions • n-grams which are at least 30 standard deviations away from

the mean are significant candidates for genome signatures.• Difference graphs: estimate the likelihood of n-gram

observed in an organism.• Genome size graphs : there is no specific relationship

between the size of genome and its standard deviation values.

• Same genus and different species, where genome size is specified: There is a noticeable difference observed between Mycobacterium species (M.leprae and M.tuberculosis).

Current and future work

• Find n-gram signatures n-grams in E.coli.• Explore the relationship between genome size and

distribution of n-gram standard deviations different species of the same organism.

• Find more specific targets to differentiate species in terms of signature peptides for all the 44 organisms taken for study.