welcome to introduction to bioinformatics intro to scenario 8 identification of genes of foreign...
TRANSCRIPT
Welcome toIntroduction to BioinformaticS
• Intro to Scenario 8 Identification of genes of foreign origin
Scenario 1Comparison of genomes of
pathogenic and nonpathogenic E. coli
E. coli K12 E. coli O157:H7
E. coli: What makes it kill?
Gene finder
TCTACTTATA TTCAATCCAC AGGGCTACACAAGAGTCTGT TGAATGAACA CATACATGGTTTCTGTCTGC TCTGACCTCT GGCAGCTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCGTAAAC CTCTAACATG ATGTCAGCAA TGAATAAACT TTGTTAAAGG TACAAATGAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT AAACCTGTAT GGTTACATGA ACTGCCTAAA TTATATATTT TAAGAAATTA ATTGCAATTA CCCCAGCTGT CATTAAAAAG AGGCAAATAC GACAGCACTG ACCCTCAAGA AGGCACCGGC GCTGAAATTC CGCTGAGAGC AGAGTGGTAC CCCTGCACCA GGTCTTTCCT GTGGGCACTG ATGAATGACT GAACGAACGA TTGAATGAAA
Gene finder
TCTACTTATA TTCAATCCAC AGGGCTACACAAGAGTCTGT TGAATGAACA CATACATGGTTTCTGTCTGC TCTGACCTCT GGCAGCTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCGTAAAC CTCTAACATG ATGTCAGCAA TGAATAAACT TTGTTAAAGG TACAAATGAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT AAACCTGTAT GGTTACATGA ACTGCCTAAA TTATATATTT TAAGAAATTA ATTGCAATTA CCCCAGCTGT CATTAAAAAG AGGCAAATAC GACAGCACTG ACCCTCAAGA AGGCACCGGC GCTGAAATTC CGCTGAGAGC AGAGTGGTAC CCCTGCACCA GGTCTTTCCT GTGGGCACTG ATGAATGACT GAACGAACGA TTGAATGAAA
Blast + parser
Pathogenspecific
(~1000!) ?
Pathogenosisspecific
E. coli: What makes it kill?
DNA Pathogen
Nonpathogen
Virus-related genes
Disease-related genes
Pathogenecity Islands (PAI)
E. coli: What makes it kill?
Gene finder
TCTACTTATA TTCAATCCAC AGGGCTACACAAGAGTCTGT TGAATGAACA CATACATGGTTTCTGTCTGC TCTGACCTCT GGCAGCTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCGTAAAC CTCTAACATG ATGTCAGCAA TGAATAAACT TTGTTAAAGG TACAAATGAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT AAACCTGTAT GGTTACATGA ACTGCCTAAA TTATATATTT TAAGAAATTA ATTGCAATTA CCCCAGCTGT CATTAAAAAG AGGCAAATAC GACAGCACTG ACCCTCAAGA AGGCACCGGC GCTGAAATTC CGCTGAGAGC AGAGTGGTAC CCCTGCACCA GGTCTTTCCT GTGGGCACTG ATGAATGACT GAACGAACGA TTGAATGAAA
Gene finder
TCTACTTATA TTCAATCCAC AGGGCTACACAAGAGTCTGT TGAATGAACA CATACATGGTTTCTGTCTGC TCTGACCTCT GGCAGCTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCGTAAAC CTCTAACATG ATGTCAGCAA TGAATAAACT TTGTTAAAGG TACAAATGAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT AAACCTGTAT GGTTACATGA ACTGCCTAAA TTATATATTT TAAGAAATTA ATTGCAATTA CCCCAGCTGT CATTAAAAAG AGGCAAATAC GACAGCACTG ACCCTCAAGA AGGCACCGGC GCTGAAATTC CGCTGAGAGC AGAGTGGTAC CCCTGCACCA GGTCTTTCCT GTGGGCACTG ATGAATGACT GAACGAACGA TTGAATGAAA
Blast + parser
Pathogenspecific
(~1000!) ?
Pathogenosisspecific
foreign genes
How to find foreign genes?
Current methods• % GC
• Dinucleotide frequencies
• Codon bias
• Amino acid bias
How to find foreign genes?
Current methods• % GC
0.29274613 0.27824858 0.32350427 0.321013 0.282145770.24678363 0.3089947 0.32791328 0.2984127 0.328798170.2797619 0.3152174 0.46096095 0.31343284 0.343434330.32916668 0.37955555 0.26495728 0.28431374 0.321937320.29405162 0.26300985 0.3646139 0.29989868 0.335164840.36720142 0.3510848 0.31604227 0.2984234 0.230240550.34567901 0.34285715 0.38206628 0.33838382 0.323809530.28865057 0.33333334 0.2717087 0.33004925 0.315466230.29333332 0.3200222 0.3412162 0.32882375 0.299435020.35947713 0.32882884 0.34351662 0.26504064 0.325062040.39583334 0.29283488 0.47674417 0.296 0.307493540.3116883 0.2754821 0.30488145 0.36578172 0.289922480.6219512 0.37037036 0.29738563 0.42553192 0.361200420.26923078 0.3197176 0.27430555 0.32748538 0.321782170.28615862 0.30438185 0.272578 0.32188296 0.352737370.28374657 0.3477868 0.31501833 0.31860465 0.29528160.2905983 0.3215859 0.3195021 0.29113925 0.34328360.30982906 0.34751773 0.35598704 0.30678466 0.329004320.29574862 0.30287206 0.3068182 0.3878205 0.34693880.25382262 0.34519956 0.3283208 0.26345214 0.2958245...
%GC of genes of Prochlorococcus
How to find foreign genes?
Current methods• % GC
0.29274613 0.27824858 0.32350427 0.321013 0.282145770.24678363 0.3089947 0.32791328 0.2984127 0.328798170.2797619 0.3152174 0.46096095 0.31343284 0.343434330.32916668 0.37955555 0.26495728 0.28431374 0.321937320.29405162 0.26300985 0.3646139 0.29989868 0.335164840.36720142 0.3510848 0.31604227 0.2984234 0.230240550.34567901 0.34285715 0.38206628 0.33838382 0.323809530.28865057 0.33333334 0.2717087 0.33004925 0.315466230.29333332 0.3200222 0.3412162 0.32882375 0.299435020.35947713 0.32882884 0.34351662 0.26504064 0.325062040.39583334 0.29283488 0.47674417 0.296 0.307493540.3116883 0.2754821 0.30488145 0.36578172 0.289922480.6219512 0.37037036 0.29738563 0.42553192 0.361200420.26923078 0.3197176 0.27430555 0.32748538 0.321782170.28615862 0.30438185 0.272578 0.32188296 0.352737370.28374657 0.3477868 0.31501833 0.31860465 0.29528160.2905983 0.3215859 0.3195021 0.29113925 0.34328360.30982906 0.34751773 0.35598704 0.30678466 0.329004320.29574862 0.30287206 0.3068182 0.3878205 0.34693880.25382262 0.34519956 0.3283208 0.26345214 0.2958245...
%GC of genes of Prochlorococcus
How to find foreign genes?
Current methods• % GC
0.29274613 0.27824858 0.32350427 0.321013 0.282145770.24678363 0.3089947 0.32791328 0.2984127 0.328798170.2797619 0.3152174 0.46096095 0.31343284 0.343434330.32916668 0.37955555 0.26495728 0.28431374 0.321937320.29405162 0.26300985 0.3646139 0.29989868 0.335164840.36720142 0.3510848 0.31604227 0.2984234 0.230240550.34567901 0.34285715 0.38206628 0.33838382 0.323809530.28865057 0.33333334 0.2717087 0.33004925 0.315466230.29333332 0.3200222 0.3412162 0.32882375 0.299435020.35947713 0.32882884 0.34351662 0.26504064 0.325062040.39583334 0.29283488 0.47674417 0.296 0.307493540.3116883 0.2754821 0.30488145 0.36578172 0.289922480.6219512 0.37037036 0.29738563 0.42553192 0.361200420.26923078 0.3197176 0.27430555 0.32748538 0.321782170.28615862 0.30438185 0.272578 0.32188296 0.352737370.28374657 0.3477868 0.31501833 0.31860465 0.29528160.2905983 0.3215859 0.3195021 0.29113925 0.34328360.30982906 0.34751773 0.35598704 0.30678466 0.329004320.29574862 0.30287206 0.3068182 0.3878205 0.34693880.25382262 0.34519956 0.3283208 0.26345214 0.2958245...
%GC of genes of Prochlorococcus
How to find foreign genes?
Current methods• % GC
• Dinucleotide frequencies
• Codon bias
• Amino acid bias
How to find foreign genes?Dinucleotide frequencies
Gene from Prochlorococcus marinus MED4ATGGAGATTGTTTGTAATCAAAATGAATTTAATTATGCTATTCAATTAGTTAGTAAAGCAGTTGCTTCAAGACCTACGCATCCTATCCTTGCAAATTTACTTCTAACAGCTGATCAAGGTACTAATAAAATTAGTTTAACTGGATTTGATTTGAATCTAGGAATACAAACTTCATTTGATGCAACTGTAAACAAAAGTGGAGCAATTACAATTCCATCTAAACTTTTATCTGAAATAGTTAATAAACTACCAAGCGAAACTCCTGTCTCTCTTGATGTTGATGAGAGTTCTGACAATATTTTAATTAAAAGTGATAGGGGTTCTTTTAATATTAAAGGTATTCCATCAGACGATTACCCAAGCTTACCGTTTGTAGAAAGTGGTACATCTTTGAATATTGATCCAAGTTCTTTTTTAAAAGCTTTAAAATTAACTATATTCGCTAGTAGTAGTGATGATTCAAAGCAATTACTCACAGGAGTAAATTTTACATTTAATTTAAAATATTTGGAGTCAGCTGCAACAGATGGGCATAGATTGGCTGTTGTTTTGGTTGATAACAAAGAAAATTTTGATGAAAAAGAAGATTTTGCTTCAAATGAAGAAAACTTATCAGTTACTATACCAACAAGATCTTTAAGAGAAATTGAAAAGCTTGTTAGCCTTAGAAGTTCTGAAAATTCAATTAAACTTTTCTATGACAAAGGTCAAGTAGTATTTATTTCCTCTAATCAAATAATTACTACTAGAACCCTTGAAGGTTCTTATCCAAATTATTCTCAATTAATACCTGATAATTTTACTAAAATTTTTACATTTAATACAAAAAAAATAATCGAATCACTTGAAAGAATAGCAGTTTTAGCAGACCAACAAAGTAGTGTCGTTAAGATTAAACTTAATGAAAAGGATTTAGCATTAGTCAGTGCTGATGCTCAAGACATAGGGAATGCCAGCGAATTAGTTCCTGTATCTTTTTATTTTGATCAATTTGATATAGCTTTTAATGTAAGGTATTTATTAGAAGGTTTAAAAGTTATATCAAGTGAAAATGTAATTTTTAAATGTAATCTTCCAACTACTCCAGCTGTTTTAGTTCCAGAAGATAATATTAATTCTTTTACGTATTTAGTCATGCCTGTTCAAGTCCGTTCTTAA
A C G T
A 167 47 80 116
C 61 26 10 65
G 65 30 23 59
T 117 59 64 168
ρ*XY = f*XY / f*X f*Y
How to find foreign genes?Dinucleotide frequencies
Study Question 3:Calculate ρ*AA in the following 50-nt sequence:
TGATGACAGTCGATTTTTCGGTAGGATAACTGCCATGCCTCTCAAAGTAC
ρ*XY = f*XY / f*X f*Y
How to find foreign genes?Dinucleotide frequencies
δ* (f, g) = (1/16) Σ | ρ*XY(f) - ρ*XY(g) |
How to find foreign genes?Dinucleotide frequencies
δ* (f, g) = (1/16) Σ | ρ*XY(f) - ρ*XY(g) |
Calculate δ*(human,mouse)?
How to find foreign genes?
Current methods• % GC
• Dinucleotide frequencies
• Codon bias
• Amino acid bias
How to find foreign genes?
Cholera toxin locus 100 kbases
How to find foreign genes?
Current methods• % GC
• Dinucleotide frequencies
• Codon bias
• Amino acid bias
How to find foreign genes?
Current methods• % GC
• Dinucleotide frequencies
• Codon bias
• Amino acid bias
Good for large blocks of nucleotides
Not as good for individual genes
Need an indicator more information-dense
How to find foreign genes?Markov Models
AAAA: 10%
AAAC: 15%
AAAG: 40%
AAAT: 35%
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
Building the model
How to find foreign genes?Markov Models
Building the model
AACA: 25%
AACC: 45%
AACG: 25%
AACT: 5%
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
How to find foreign genes?Markov Models
Using the model
A C G TAAA 0.10 0.15 0.40 0.35AAC 0.25 0.45 0.25 0.05AAG 0.25 0.20 0.30 0.25 AAT 0.25 0.20 0.30 0.25 ACA 0.15 0.20 0.25 0.40 . . .TTG 0.20 0.50 0.05 0.25TTT 0.10 0.55 0.25 0.10
Candidategene
AAAACAA…
0.10
3rd order Markov model
Analyze sequence model
Compare test sequence to model
Produce new sequence per model
How to find foreign genes?Markov Models
Analyze sequence model
Produce new sequence per model
How to find foreign genes?Markov Models
Take a test run through
Hamlet.pl
Scenario 8Gene Identification
How do you tell if an orf is real?
Genetic CodeUUU Phe UCU Ser UAU Tyr UGU CysUUC Phe UCC Ser UAC Tyr UGC CysUUA Leu UCA Ser UAA ochre UGA opalUUG Leu UCG Ser UAG amber UGG TrpCUU Leu CCU Pro CAU His CGU ArgCUC Leu CCC Pro CAC His CGC ArgCUA Leu CCA Pro CAA Gln CGA ArgCUG Leu CCG Pro CAG Gln CGG ArgAUU Ile ACU Thr AAU Asn AGU SerAUC Ile ACC Thr AAC Asn AGC SerAUA Ile ACA Thr AAA Lys AGA ArgAUG Met ACG Thr AAG Lys AGG ArgGUU Val GCU Ala GAU Asp GGU GlyGUC Val GCC Ala GAC Asp GGC GlyGUA Val GCA Ala GAA Glu GGA GlyGUG Val GCG Ala GAG Glu GGG Gly
The code is degenerate
Are codons equally used?
Scenario 8Gene Identification
How do you tell if an orf is real?
Genetic Code (human)UUU Phe UCU Ser UAU Tyr UGU CysUUC Phe UCC Ser UAC Tyr UGC CysUUA Leu UCA Ser UAA ochre UGA opalUUG Leu UCG Ser UAG amber UGG TrpCUU Leu CCU Pro CAU His CGU ArgCUC Leu CCC Pro CAC His CGC ArgCUA Leu CCA Pro CAA Gln CGA ArgCUG Leu CCG Pro CAG Gln CGG ArgAUU Ile ACU Thr AAU Asn AGU SerAUC Ile ACC Thr AAC Asn AGC SerAUA Ile ACA Thr AAA Lys AGA ArgAUG Met ACG Thr AAG Lys AGG ArgGUU Val GCU Ala GAU Asp GGU GlyGUC Val GCC Ala GAC Asp GGC GlyGUA Val GCA Ala GAA Glu GGA GlyGUG Val GCG Ala GAG Glu GGG Gly
The third position is biased
Most frequently used codons
Scenario 8Gene Identification
How do you tell if an orf is real?
ATGCGGTGGGCCCAACCACATCGTGGGCAGTCCCTT
Genetic Code (human)UUU Phe UCU Ser UAU Tyr UGU CysUUC Phe UCC Ser UAC Tyr UGC CysUUA Leu UCA Ser UAA ochre UGA opalUUG Leu UCG Ser UAG amber UGG TrpCUU Leu CCU Pro CAU His CGU ArgCUC Leu CCC Pro CAC His CGC ArgCUA Leu CCA Pro CAA Gln CGA ArgCUG Leu CCG Pro CAG Gln CGG ArgAUU Ile ACU Thr AAU Asn AGU SerAUC Ile ACC Thr AAC Asn AGC SerAUA Ile ACA Thr AAA Lys AGA ArgAUG Met ACG Thr AAG Lys AGG ArgGUU Val GCU Ala GAU Asp GGU GlyGUC Val GCC Ala GAC Asp GGC GlyGUA Val GCA Ala GAA Glu GGA GlyGUG Val GCG Ala GAG Glu GGG Gly
Most frequently used codons
PSSM for third position?
If AT G
If CG G C A T
Third order Markov Chain