![Page 1: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/1.jpg)
Welcome toIntroduction to BioinformaticS
• Intro to Scenario 8 Identification of genes of foreign origin
![Page 2: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/2.jpg)
Scenario 1Comparison of genomes of
pathogenic and nonpathogenic E. coli
E. coli K12 E. coli O157:H7
![Page 3: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/3.jpg)
E. coli: What makes it kill?
Gene finder
TCTACTTATA TTCAATCCAC AGGGCTACACAAGAGTCTGT TGAATGAACA CATACATGGTTTCTGTCTGC TCTGACCTCT GGCAGCTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCGTAAAC CTCTAACATG ATGTCAGCAA TGAATAAACT TTGTTAAAGG TACAAATGAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT AAACCTGTAT GGTTACATGA ACTGCCTAAA TTATATATTT TAAGAAATTA ATTGCAATTA CCCCAGCTGT CATTAAAAAG AGGCAAATAC GACAGCACTG ACCCTCAAGA AGGCACCGGC GCTGAAATTC CGCTGAGAGC AGAGTGGTAC CCCTGCACCA GGTCTTTCCT GTGGGCACTG ATGAATGACT GAACGAACGA TTGAATGAAA
Gene finder
TCTACTTATA TTCAATCCAC AGGGCTACACAAGAGTCTGT TGAATGAACA CATACATGGTTTCTGTCTGC TCTGACCTCT GGCAGCTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCGTAAAC CTCTAACATG ATGTCAGCAA TGAATAAACT TTGTTAAAGG TACAAATGAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT AAACCTGTAT GGTTACATGA ACTGCCTAAA TTATATATTT TAAGAAATTA ATTGCAATTA CCCCAGCTGT CATTAAAAAG AGGCAAATAC GACAGCACTG ACCCTCAAGA AGGCACCGGC GCTGAAATTC CGCTGAGAGC AGAGTGGTAC CCCTGCACCA GGTCTTTCCT GTGGGCACTG ATGAATGACT GAACGAACGA TTGAATGAAA
Blast + parser
Pathogenspecific
(~1000!) ?
Pathogenosisspecific
![Page 4: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/4.jpg)
E. coli: What makes it kill?
DNA Pathogen
Nonpathogen
Virus-related genes
Disease-related genes
Pathogenecity Islands (PAI)
![Page 5: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/5.jpg)
E. coli: What makes it kill?
Gene finder
TCTACTTATA TTCAATCCAC AGGGCTACACAAGAGTCTGT TGAATGAACA CATACATGGTTTCTGTCTGC TCTGACCTCT GGCAGCTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCGTAAAC CTCTAACATG ATGTCAGCAA TGAATAAACT TTGTTAAAGG TACAAATGAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT AAACCTGTAT GGTTACATGA ACTGCCTAAA TTATATATTT TAAGAAATTA ATTGCAATTA CCCCAGCTGT CATTAAAAAG AGGCAAATAC GACAGCACTG ACCCTCAAGA AGGCACCGGC GCTGAAATTC CGCTGAGAGC AGAGTGGTAC CCCTGCACCA GGTCTTTCCT GTGGGCACTG ATGAATGACT GAACGAACGA TTGAATGAAA
Gene finder
TCTACTTATA TTCAATCCAC AGGGCTACACAAGAGTCTGT TGAATGAACA CATACATGGTTTCTGTCTGC TCTGACCTCT GGCAGCTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCGTAAAC CTCTAACATG ATGTCAGCAA TGAATAAACT TTGTTAAAGG TACAAATGAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT AAACCTGTAT GGTTACATGA ACTGCCTAAA TTATATATTT TAAGAAATTA ATTGCAATTA CCCCAGCTGT CATTAAAAAG AGGCAAATAC GACAGCACTG ACCCTCAAGA AGGCACCGGC GCTGAAATTC CGCTGAGAGC AGAGTGGTAC CCCTGCACCA GGTCTTTCCT GTGGGCACTG ATGAATGACT GAACGAACGA TTGAATGAAA
Blast + parser
Pathogenspecific
(~1000!) ?
Pathogenosisspecific
foreign genes
![Page 6: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/6.jpg)
How to find foreign genes?
Current methods• % GC
• Dinucleotide frequencies
• Codon bias
• Amino acid bias
![Page 7: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/7.jpg)
How to find foreign genes?
Current methods• % GC
0.29274613 0.27824858 0.32350427 0.321013 0.282145770.24678363 0.3089947 0.32791328 0.2984127 0.328798170.2797619 0.3152174 0.46096095 0.31343284 0.343434330.32916668 0.37955555 0.26495728 0.28431374 0.321937320.29405162 0.26300985 0.3646139 0.29989868 0.335164840.36720142 0.3510848 0.31604227 0.2984234 0.230240550.34567901 0.34285715 0.38206628 0.33838382 0.323809530.28865057 0.33333334 0.2717087 0.33004925 0.315466230.29333332 0.3200222 0.3412162 0.32882375 0.299435020.35947713 0.32882884 0.34351662 0.26504064 0.325062040.39583334 0.29283488 0.47674417 0.296 0.307493540.3116883 0.2754821 0.30488145 0.36578172 0.289922480.6219512 0.37037036 0.29738563 0.42553192 0.361200420.26923078 0.3197176 0.27430555 0.32748538 0.321782170.28615862 0.30438185 0.272578 0.32188296 0.352737370.28374657 0.3477868 0.31501833 0.31860465 0.29528160.2905983 0.3215859 0.3195021 0.29113925 0.34328360.30982906 0.34751773 0.35598704 0.30678466 0.329004320.29574862 0.30287206 0.3068182 0.3878205 0.34693880.25382262 0.34519956 0.3283208 0.26345214 0.2958245...
%GC of genes of Prochlorococcus
![Page 8: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/8.jpg)
How to find foreign genes?
Current methods• % GC
0.29274613 0.27824858 0.32350427 0.321013 0.282145770.24678363 0.3089947 0.32791328 0.2984127 0.328798170.2797619 0.3152174 0.46096095 0.31343284 0.343434330.32916668 0.37955555 0.26495728 0.28431374 0.321937320.29405162 0.26300985 0.3646139 0.29989868 0.335164840.36720142 0.3510848 0.31604227 0.2984234 0.230240550.34567901 0.34285715 0.38206628 0.33838382 0.323809530.28865057 0.33333334 0.2717087 0.33004925 0.315466230.29333332 0.3200222 0.3412162 0.32882375 0.299435020.35947713 0.32882884 0.34351662 0.26504064 0.325062040.39583334 0.29283488 0.47674417 0.296 0.307493540.3116883 0.2754821 0.30488145 0.36578172 0.289922480.6219512 0.37037036 0.29738563 0.42553192 0.361200420.26923078 0.3197176 0.27430555 0.32748538 0.321782170.28615862 0.30438185 0.272578 0.32188296 0.352737370.28374657 0.3477868 0.31501833 0.31860465 0.29528160.2905983 0.3215859 0.3195021 0.29113925 0.34328360.30982906 0.34751773 0.35598704 0.30678466 0.329004320.29574862 0.30287206 0.3068182 0.3878205 0.34693880.25382262 0.34519956 0.3283208 0.26345214 0.2958245...
%GC of genes of Prochlorococcus
![Page 9: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/9.jpg)
How to find foreign genes?
Current methods• % GC
0.29274613 0.27824858 0.32350427 0.321013 0.282145770.24678363 0.3089947 0.32791328 0.2984127 0.328798170.2797619 0.3152174 0.46096095 0.31343284 0.343434330.32916668 0.37955555 0.26495728 0.28431374 0.321937320.29405162 0.26300985 0.3646139 0.29989868 0.335164840.36720142 0.3510848 0.31604227 0.2984234 0.230240550.34567901 0.34285715 0.38206628 0.33838382 0.323809530.28865057 0.33333334 0.2717087 0.33004925 0.315466230.29333332 0.3200222 0.3412162 0.32882375 0.299435020.35947713 0.32882884 0.34351662 0.26504064 0.325062040.39583334 0.29283488 0.47674417 0.296 0.307493540.3116883 0.2754821 0.30488145 0.36578172 0.289922480.6219512 0.37037036 0.29738563 0.42553192 0.361200420.26923078 0.3197176 0.27430555 0.32748538 0.321782170.28615862 0.30438185 0.272578 0.32188296 0.352737370.28374657 0.3477868 0.31501833 0.31860465 0.29528160.2905983 0.3215859 0.3195021 0.29113925 0.34328360.30982906 0.34751773 0.35598704 0.30678466 0.329004320.29574862 0.30287206 0.3068182 0.3878205 0.34693880.25382262 0.34519956 0.3283208 0.26345214 0.2958245...
%GC of genes of Prochlorococcus
![Page 10: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/10.jpg)
How to find foreign genes?
Current methods• % GC
• Dinucleotide frequencies
• Codon bias
• Amino acid bias
![Page 11: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/11.jpg)
How to find foreign genes?Dinucleotide frequencies
Gene from Prochlorococcus marinus MED4ATGGAGATTGTTTGTAATCAAAATGAATTTAATTATGCTATTCAATTAGTTAGTAAAGCAGTTGCTTCAAGACCTACGCATCCTATCCTTGCAAATTTACTTCTAACAGCTGATCAAGGTACTAATAAAATTAGTTTAACTGGATTTGATTTGAATCTAGGAATACAAACTTCATTTGATGCAACTGTAAACAAAAGTGGAGCAATTACAATTCCATCTAAACTTTTATCTGAAATAGTTAATAAACTACCAAGCGAAACTCCTGTCTCTCTTGATGTTGATGAGAGTTCTGACAATATTTTAATTAAAAGTGATAGGGGTTCTTTTAATATTAAAGGTATTCCATCAGACGATTACCCAAGCTTACCGTTTGTAGAAAGTGGTACATCTTTGAATATTGATCCAAGTTCTTTTTTAAAAGCTTTAAAATTAACTATATTCGCTAGTAGTAGTGATGATTCAAAGCAATTACTCACAGGAGTAAATTTTACATTTAATTTAAAATATTTGGAGTCAGCTGCAACAGATGGGCATAGATTGGCTGTTGTTTTGGTTGATAACAAAGAAAATTTTGATGAAAAAGAAGATTTTGCTTCAAATGAAGAAAACTTATCAGTTACTATACCAACAAGATCTTTAAGAGAAATTGAAAAGCTTGTTAGCCTTAGAAGTTCTGAAAATTCAATTAAACTTTTCTATGACAAAGGTCAAGTAGTATTTATTTCCTCTAATCAAATAATTACTACTAGAACCCTTGAAGGTTCTTATCCAAATTATTCTCAATTAATACCTGATAATTTTACTAAAATTTTTACATTTAATACAAAAAAAATAATCGAATCACTTGAAAGAATAGCAGTTTTAGCAGACCAACAAAGTAGTGTCGTTAAGATTAAACTTAATGAAAAGGATTTAGCATTAGTCAGTGCTGATGCTCAAGACATAGGGAATGCCAGCGAATTAGTTCCTGTATCTTTTTATTTTGATCAATTTGATATAGCTTTTAATGTAAGGTATTTATTAGAAGGTTTAAAAGTTATATCAAGTGAAAATGTAATTTTTAAATGTAATCTTCCAACTACTCCAGCTGTTTTAGTTCCAGAAGATAATATTAATTCTTTTACGTATTTAGTCATGCCTGTTCAAGTCCGTTCTTAA
A C G T
A 167 47 80 116
C 61 26 10 65
G 65 30 23 59
T 117 59 64 168
ρ*XY = f*XY / f*X f*Y
![Page 12: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/12.jpg)
How to find foreign genes?Dinucleotide frequencies
Study Question 3:Calculate ρ*AA in the following 50-nt sequence:
TGATGACAGTCGATTTTTCGGTAGGATAACTGCCATGCCTCTCAAAGTAC
ρ*XY = f*XY / f*X f*Y
![Page 13: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/13.jpg)
How to find foreign genes?Dinucleotide frequencies
δ* (f, g) = (1/16) Σ | ρ*XY(f) - ρ*XY(g) |
![Page 14: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/14.jpg)
How to find foreign genes?Dinucleotide frequencies
δ* (f, g) = (1/16) Σ | ρ*XY(f) - ρ*XY(g) |
Calculate δ*(human,mouse)?
![Page 15: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/15.jpg)
How to find foreign genes?
Current methods• % GC
• Dinucleotide frequencies
• Codon bias
• Amino acid bias
![Page 16: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/16.jpg)
How to find foreign genes?
Cholera toxin locus 100 kbases
![Page 17: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/17.jpg)
How to find foreign genes?
Current methods• % GC
• Dinucleotide frequencies
• Codon bias
• Amino acid bias
![Page 18: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/18.jpg)
How to find foreign genes?
Current methods• % GC
• Dinucleotide frequencies
• Codon bias
• Amino acid bias
Good for large blocks of nucleotides
Not as good for individual genes
Need an indicator more information-dense
![Page 19: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/19.jpg)
How to find foreign genes?Markov Models
AAAA: 10%
AAAC: 15%
AAAG: 40%
AAAT: 35%
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
Building the model
![Page 20: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/20.jpg)
How to find foreign genes?Markov Models
Building the model
AACA: 25%
AACC: 45%
AACG: 25%
AACT: 5%
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
![Page 21: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/21.jpg)
How to find foreign genes?Markov Models
Using the model
A C G TAAA 0.10 0.15 0.40 0.35AAC 0.25 0.45 0.25 0.05AAG 0.25 0.20 0.30 0.25 AAT 0.25 0.20 0.30 0.25 ACA 0.15 0.20 0.25 0.40 . . .TTG 0.20 0.50 0.05 0.25TTT 0.10 0.55 0.25 0.10
Candidategene
AAAACAA…
0.10
3rd order Markov model
![Page 22: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/22.jpg)
Analyze sequence model
Compare test sequence to model
Produce new sequence per model
How to find foreign genes?Markov Models
![Page 23: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/23.jpg)
Analyze sequence model
Produce new sequence per model
How to find foreign genes?Markov Models
Take a test run through
Hamlet.pl
![Page 24: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/24.jpg)
Scenario 8Gene Identification
How do you tell if an orf is real?
Genetic CodeUUU Phe UCU Ser UAU Tyr UGU CysUUC Phe UCC Ser UAC Tyr UGC CysUUA Leu UCA Ser UAA ochre UGA opalUUG Leu UCG Ser UAG amber UGG TrpCUU Leu CCU Pro CAU His CGU ArgCUC Leu CCC Pro CAC His CGC ArgCUA Leu CCA Pro CAA Gln CGA ArgCUG Leu CCG Pro CAG Gln CGG ArgAUU Ile ACU Thr AAU Asn AGU SerAUC Ile ACC Thr AAC Asn AGC SerAUA Ile ACA Thr AAA Lys AGA ArgAUG Met ACG Thr AAG Lys AGG ArgGUU Val GCU Ala GAU Asp GGU GlyGUC Val GCC Ala GAC Asp GGC GlyGUA Val GCA Ala GAA Glu GGA GlyGUG Val GCG Ala GAG Glu GGG Gly
The code is degenerate
Are codons equally used?
![Page 25: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/25.jpg)
Scenario 8Gene Identification
How do you tell if an orf is real?
Genetic Code (human)UUU Phe UCU Ser UAU Tyr UGU CysUUC Phe UCC Ser UAC Tyr UGC CysUUA Leu UCA Ser UAA ochre UGA opalUUG Leu UCG Ser UAG amber UGG TrpCUU Leu CCU Pro CAU His CGU ArgCUC Leu CCC Pro CAC His CGC ArgCUA Leu CCA Pro CAA Gln CGA ArgCUG Leu CCG Pro CAG Gln CGG ArgAUU Ile ACU Thr AAU Asn AGU SerAUC Ile ACC Thr AAC Asn AGC SerAUA Ile ACA Thr AAA Lys AGA ArgAUG Met ACG Thr AAG Lys AGG ArgGUU Val GCU Ala GAU Asp GGU GlyGUC Val GCC Ala GAC Asp GGC GlyGUA Val GCA Ala GAA Glu GGA GlyGUG Val GCG Ala GAG Glu GGG Gly
The third position is biased
Most frequently used codons
![Page 26: Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin](https://reader035.vdocuments.us/reader035/viewer/2022062500/5697bffb1a28abf838cc0c36/html5/thumbnails/26.jpg)
Scenario 8Gene Identification
How do you tell if an orf is real?
ATGCGGTGGGCCCAACCACATCGTGGGCAGTCCCTT
Genetic Code (human)UUU Phe UCU Ser UAU Tyr UGU CysUUC Phe UCC Ser UAC Tyr UGC CysUUA Leu UCA Ser UAA ochre UGA opalUUG Leu UCG Ser UAG amber UGG TrpCUU Leu CCU Pro CAU His CGU ArgCUC Leu CCC Pro CAC His CGC ArgCUA Leu CCA Pro CAA Gln CGA ArgCUG Leu CCG Pro CAG Gln CGG ArgAUU Ile ACU Thr AAU Asn AGU SerAUC Ile ACC Thr AAC Asn AGC SerAUA Ile ACA Thr AAA Lys AGA ArgAUG Met ACG Thr AAG Lys AGG ArgGUU Val GCU Ala GAU Asp GGU GlyGUC Val GCC Ala GAC Asp GGC GlyGUA Val GCA Ala GAA Glu GGA GlyGUG Val GCG Ala GAG Glu GGG Gly
Most frequently used codons
PSSM for third position?
If AT G
If CG G C A T
Third order Markov Chain