welcome to introduction to bioinformatics intro to scenario 8 identification of genes of foreign...

26
Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

Upload: julian-reed

Post on 21-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

Welcome toIntroduction to BioinformaticS

• Intro to Scenario 8 Identification of genes of foreign origin

Page 2: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

Scenario 1Comparison of genomes of

pathogenic and nonpathogenic E. coli

E. coli K12 E. coli O157:H7

Page 3: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

E. coli: What makes it kill?

Gene finder

TCTACTTATA TTCAATCCAC AGGGCTACACAAGAGTCTGT TGAATGAACA CATACATGGTTTCTGTCTGC TCTGACCTCT GGCAGCTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCGTAAAC CTCTAACATG ATGTCAGCAA TGAATAAACT TTGTTAAAGG TACAAATGAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT AAACCTGTAT GGTTACATGA ACTGCCTAAA TTATATATTT TAAGAAATTA ATTGCAATTA CCCCAGCTGT CATTAAAAAG AGGCAAATAC GACAGCACTG ACCCTCAAGA AGGCACCGGC GCTGAAATTC CGCTGAGAGC AGAGTGGTAC CCCTGCACCA GGTCTTTCCT GTGGGCACTG ATGAATGACT GAACGAACGA TTGAATGAAA

Gene finder

TCTACTTATA TTCAATCCAC AGGGCTACACAAGAGTCTGT TGAATGAACA CATACATGGTTTCTGTCTGC TCTGACCTCT GGCAGCTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCGTAAAC CTCTAACATG ATGTCAGCAA TGAATAAACT TTGTTAAAGG TACAAATGAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT AAACCTGTAT GGTTACATGA ACTGCCTAAA TTATATATTT TAAGAAATTA ATTGCAATTA CCCCAGCTGT CATTAAAAAG AGGCAAATAC GACAGCACTG ACCCTCAAGA AGGCACCGGC GCTGAAATTC CGCTGAGAGC AGAGTGGTAC CCCTGCACCA GGTCTTTCCT GTGGGCACTG ATGAATGACT GAACGAACGA TTGAATGAAA

Blast + parser

Pathogenspecific

(~1000!) ?

Pathogenosisspecific

Page 4: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

E. coli: What makes it kill?

DNA Pathogen

Nonpathogen

Virus-related genes

Disease-related genes

Pathogenecity Islands (PAI)

Page 5: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

E. coli: What makes it kill?

Gene finder

TCTACTTATA TTCAATCCAC AGGGCTACACAAGAGTCTGT TGAATGAACA CATACATGGTTTCTGTCTGC TCTGACCTCT GGCAGCTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCGTAAAC CTCTAACATG ATGTCAGCAA TGAATAAACT TTGTTAAAGG TACAAATGAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT AAACCTGTAT GGTTACATGA ACTGCCTAAA TTATATATTT TAAGAAATTA ATTGCAATTA CCCCAGCTGT CATTAAAAAG AGGCAAATAC GACAGCACTG ACCCTCAAGA AGGCACCGGC GCTGAAATTC CGCTGAGAGC AGAGTGGTAC CCCTGCACCA GGTCTTTCCT GTGGGCACTG ATGAATGACT GAACGAACGA TTGAATGAAA

Gene finder

TCTACTTATA TTCAATCCAC AGGGCTACACAAGAGTCTGT TGAATGAACA CATACATGGTTTCTGTCTGC TCTGACCTCT GGCAGCTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCGTAAAC CTCTAACATG ATGTCAGCAA TGAATAAACT TTGTTAAAGG TACAAATGAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT AAACCTGTAT GGTTACATGA ACTGCCTAAA TTATATATTT TAAGAAATTA ATTGCAATTA CCCCAGCTGT CATTAAAAAG AGGCAAATAC GACAGCACTG ACCCTCAAGA AGGCACCGGC GCTGAAATTC CGCTGAGAGC AGAGTGGTAC CCCTGCACCA GGTCTTTCCT GTGGGCACTG ATGAATGACT GAACGAACGA TTGAATGAAA

Blast + parser

Pathogenspecific

(~1000!) ?

Pathogenosisspecific

foreign genes

Page 6: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?

Current methods• % GC

• Dinucleotide frequencies

• Codon bias

• Amino acid bias

Page 7: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?

Current methods• % GC

0.29274613 0.27824858 0.32350427 0.321013 0.282145770.24678363 0.3089947 0.32791328 0.2984127 0.328798170.2797619 0.3152174 0.46096095 0.31343284 0.343434330.32916668 0.37955555 0.26495728 0.28431374 0.321937320.29405162 0.26300985 0.3646139 0.29989868 0.335164840.36720142 0.3510848 0.31604227 0.2984234 0.230240550.34567901 0.34285715 0.38206628 0.33838382 0.323809530.28865057 0.33333334 0.2717087 0.33004925 0.315466230.29333332 0.3200222 0.3412162 0.32882375 0.299435020.35947713 0.32882884 0.34351662 0.26504064 0.325062040.39583334 0.29283488 0.47674417 0.296 0.307493540.3116883 0.2754821 0.30488145 0.36578172 0.289922480.6219512 0.37037036 0.29738563 0.42553192 0.361200420.26923078 0.3197176 0.27430555 0.32748538 0.321782170.28615862 0.30438185 0.272578 0.32188296 0.352737370.28374657 0.3477868 0.31501833 0.31860465 0.29528160.2905983 0.3215859 0.3195021 0.29113925 0.34328360.30982906 0.34751773 0.35598704 0.30678466 0.329004320.29574862 0.30287206 0.3068182 0.3878205 0.34693880.25382262 0.34519956 0.3283208 0.26345214 0.2958245...

%GC of genes of Prochlorococcus

Page 8: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?

Current methods• % GC

0.29274613 0.27824858 0.32350427 0.321013 0.282145770.24678363 0.3089947 0.32791328 0.2984127 0.328798170.2797619 0.3152174 0.46096095 0.31343284 0.343434330.32916668 0.37955555 0.26495728 0.28431374 0.321937320.29405162 0.26300985 0.3646139 0.29989868 0.335164840.36720142 0.3510848 0.31604227 0.2984234 0.230240550.34567901 0.34285715 0.38206628 0.33838382 0.323809530.28865057 0.33333334 0.2717087 0.33004925 0.315466230.29333332 0.3200222 0.3412162 0.32882375 0.299435020.35947713 0.32882884 0.34351662 0.26504064 0.325062040.39583334 0.29283488 0.47674417 0.296 0.307493540.3116883 0.2754821 0.30488145 0.36578172 0.289922480.6219512 0.37037036 0.29738563 0.42553192 0.361200420.26923078 0.3197176 0.27430555 0.32748538 0.321782170.28615862 0.30438185 0.272578 0.32188296 0.352737370.28374657 0.3477868 0.31501833 0.31860465 0.29528160.2905983 0.3215859 0.3195021 0.29113925 0.34328360.30982906 0.34751773 0.35598704 0.30678466 0.329004320.29574862 0.30287206 0.3068182 0.3878205 0.34693880.25382262 0.34519956 0.3283208 0.26345214 0.2958245...

%GC of genes of Prochlorococcus

Page 9: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?

Current methods• % GC

0.29274613 0.27824858 0.32350427 0.321013 0.282145770.24678363 0.3089947 0.32791328 0.2984127 0.328798170.2797619 0.3152174 0.46096095 0.31343284 0.343434330.32916668 0.37955555 0.26495728 0.28431374 0.321937320.29405162 0.26300985 0.3646139 0.29989868 0.335164840.36720142 0.3510848 0.31604227 0.2984234 0.230240550.34567901 0.34285715 0.38206628 0.33838382 0.323809530.28865057 0.33333334 0.2717087 0.33004925 0.315466230.29333332 0.3200222 0.3412162 0.32882375 0.299435020.35947713 0.32882884 0.34351662 0.26504064 0.325062040.39583334 0.29283488 0.47674417 0.296 0.307493540.3116883 0.2754821 0.30488145 0.36578172 0.289922480.6219512 0.37037036 0.29738563 0.42553192 0.361200420.26923078 0.3197176 0.27430555 0.32748538 0.321782170.28615862 0.30438185 0.272578 0.32188296 0.352737370.28374657 0.3477868 0.31501833 0.31860465 0.29528160.2905983 0.3215859 0.3195021 0.29113925 0.34328360.30982906 0.34751773 0.35598704 0.30678466 0.329004320.29574862 0.30287206 0.3068182 0.3878205 0.34693880.25382262 0.34519956 0.3283208 0.26345214 0.2958245...

%GC of genes of Prochlorococcus

Page 10: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?

Current methods• % GC

• Dinucleotide frequencies

• Codon bias

• Amino acid bias

Page 11: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?Dinucleotide frequencies

Gene from Prochlorococcus marinus MED4ATGGAGATTGTTTGTAATCAAAATGAATTTAATTATGCTATTCAATTAGTTAGTAAAGCAGTTGCTTCAAGACCTACGCATCCTATCCTTGCAAATTTACTTCTAACAGCTGATCAAGGTACTAATAAAATTAGTTTAACTGGATTTGATTTGAATCTAGGAATACAAACTTCATTTGATGCAACTGTAAACAAAAGTGGAGCAATTACAATTCCATCTAAACTTTTATCTGAAATAGTTAATAAACTACCAAGCGAAACTCCTGTCTCTCTTGATGTTGATGAGAGTTCTGACAATATTTTAATTAAAAGTGATAGGGGTTCTTTTAATATTAAAGGTATTCCATCAGACGATTACCCAAGCTTACCGTTTGTAGAAAGTGGTACATCTTTGAATATTGATCCAAGTTCTTTTTTAAAAGCTTTAAAATTAACTATATTCGCTAGTAGTAGTGATGATTCAAAGCAATTACTCACAGGAGTAAATTTTACATTTAATTTAAAATATTTGGAGTCAGCTGCAACAGATGGGCATAGATTGGCTGTTGTTTTGGTTGATAACAAAGAAAATTTTGATGAAAAAGAAGATTTTGCTTCAAATGAAGAAAACTTATCAGTTACTATACCAACAAGATCTTTAAGAGAAATTGAAAAGCTTGTTAGCCTTAGAAGTTCTGAAAATTCAATTAAACTTTTCTATGACAAAGGTCAAGTAGTATTTATTTCCTCTAATCAAATAATTACTACTAGAACCCTTGAAGGTTCTTATCCAAATTATTCTCAATTAATACCTGATAATTTTACTAAAATTTTTACATTTAATACAAAAAAAATAATCGAATCACTTGAAAGAATAGCAGTTTTAGCAGACCAACAAAGTAGTGTCGTTAAGATTAAACTTAATGAAAAGGATTTAGCATTAGTCAGTGCTGATGCTCAAGACATAGGGAATGCCAGCGAATTAGTTCCTGTATCTTTTTATTTTGATCAATTTGATATAGCTTTTAATGTAAGGTATTTATTAGAAGGTTTAAAAGTTATATCAAGTGAAAATGTAATTTTTAAATGTAATCTTCCAACTACTCCAGCTGTTTTAGTTCCAGAAGATAATATTAATTCTTTTACGTATTTAGTCATGCCTGTTCAAGTCCGTTCTTAA

A C G T

A 167 47 80 116

C 61 26 10 65

G 65 30 23 59

T 117 59 64 168

ρ*XY = f*XY / f*X f*Y

Page 12: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?Dinucleotide frequencies

Study Question 3:Calculate ρ*AA in the following 50-nt sequence:

TGATGACAGTCGATTTTTCGGTAGGATAACTGCCATGCCTCTCAAAGTAC

ρ*XY = f*XY / f*X f*Y

Page 13: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?Dinucleotide frequencies

δ* (f, g) = (1/16) Σ | ρ*XY(f) - ρ*XY(g) |

Page 14: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?Dinucleotide frequencies

δ* (f, g) = (1/16) Σ | ρ*XY(f) - ρ*XY(g) |

Calculate δ*(human,mouse)?

Page 15: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?

Current methods• % GC

• Dinucleotide frequencies

• Codon bias

• Amino acid bias

Page 16: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?

Cholera toxin locus 100 kbases

Page 17: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?

Current methods• % GC

• Dinucleotide frequencies

• Codon bias

• Amino acid bias

Page 18: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?

Current methods• % GC

• Dinucleotide frequencies

• Codon bias

• Amino acid bias

Good for large blocks of nucleotides

Not as good for individual genes

Need an indicator more information-dense

Page 19: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?Markov Models

AAAA: 10%

AAAC: 15%

AAAG: 40%

AAAT: 35%

AAAAACAAGAATACA . . .TTGTTT

TrainingSet

Building the model

Page 20: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?Markov Models

Building the model

AACA: 25%

AACC: 45%

AACG: 25%

AACT: 5%

AAAAACAAGAATACA . . .TTGTTT

TrainingSet

Page 21: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

How to find foreign genes?Markov Models

Using the model

A C G TAAA 0.10 0.15 0.40 0.35AAC 0.25 0.45 0.25 0.05AAG 0.25 0.20 0.30 0.25 AAT 0.25 0.20 0.30 0.25 ACA 0.15 0.20 0.25 0.40 . . .TTG 0.20 0.50 0.05 0.25TTT 0.10 0.55 0.25 0.10

Candidategene

AAAACAA…

0.10

3rd order Markov model

Page 22: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

Analyze sequence model

Compare test sequence to model

Produce new sequence per model

How to find foreign genes?Markov Models

Page 23: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

Analyze sequence model

Produce new sequence per model

How to find foreign genes?Markov Models

Take a test run through

Hamlet.pl

Page 24: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

Scenario 8Gene Identification

How do you tell if an orf is real?

Genetic CodeUUU Phe UCU Ser UAU Tyr UGU CysUUC Phe UCC Ser UAC Tyr UGC CysUUA Leu UCA Ser UAA ochre UGA opalUUG Leu UCG Ser UAG amber UGG TrpCUU Leu CCU Pro CAU His CGU ArgCUC Leu CCC Pro CAC His CGC ArgCUA Leu CCA Pro CAA Gln CGA ArgCUG Leu CCG Pro CAG Gln CGG ArgAUU Ile ACU Thr AAU Asn AGU SerAUC Ile ACC Thr AAC Asn AGC SerAUA Ile ACA Thr AAA Lys AGA ArgAUG Met ACG Thr AAG Lys AGG ArgGUU Val GCU Ala GAU Asp GGU GlyGUC Val GCC Ala GAC Asp GGC GlyGUA Val GCA Ala GAA Glu GGA GlyGUG Val GCG Ala GAG Glu GGG Gly

The code is degenerate

Are codons equally used?

Page 25: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

Scenario 8Gene Identification

How do you tell if an orf is real?

Genetic Code (human)UUU Phe UCU Ser UAU Tyr UGU CysUUC Phe UCC Ser UAC Tyr UGC CysUUA Leu UCA Ser UAA ochre UGA opalUUG Leu UCG Ser UAG amber UGG TrpCUU Leu CCU Pro CAU His CGU ArgCUC Leu CCC Pro CAC His CGC ArgCUA Leu CCA Pro CAA Gln CGA ArgCUG Leu CCG Pro CAG Gln CGG ArgAUU Ile ACU Thr AAU Asn AGU SerAUC Ile ACC Thr AAC Asn AGC SerAUA Ile ACA Thr AAA Lys AGA ArgAUG Met ACG Thr AAG Lys AGG ArgGUU Val GCU Ala GAU Asp GGU GlyGUC Val GCC Ala GAC Asp GGC GlyGUA Val GCA Ala GAA Glu GGA GlyGUG Val GCG Ala GAG Glu GGG Gly

The third position is biased

Most frequently used codons

Page 26: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin

Scenario 8Gene Identification

How do you tell if an orf is real?

ATGCGGTGGGCCCAACCACATCGTGGGCAGTCCCTT

Genetic Code (human)UUU Phe UCU Ser UAU Tyr UGU CysUUC Phe UCC Ser UAC Tyr UGC CysUUA Leu UCA Ser UAA ochre UGA opalUUG Leu UCG Ser UAG amber UGG TrpCUU Leu CCU Pro CAU His CGU ArgCUC Leu CCC Pro CAC His CGC ArgCUA Leu CCA Pro CAA Gln CGA ArgCUG Leu CCG Pro CAG Gln CGG ArgAUU Ile ACU Thr AAU Asn AGU SerAUC Ile ACC Thr AAC Asn AGC SerAUA Ile ACA Thr AAA Lys AGA ArgAUG Met ACG Thr AAG Lys AGG ArgGUU Val GCU Ala GAU Asp GGU GlyGUC Val GCC Ala GAC Asp GGC GlyGUA Val GCA Ala GAA Glu GGA GlyGUG Val GCG Ala GAG Glu GGG Gly

Most frequently used codons

PSSM for third position?

If AT G

If CG G C A T

Third order Markov Chain