gene prediction: preliminary results
TRANSCRIPT
![Page 1: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/1.jpg)
Gene Prediction:
Preliminary Results
Erin Cook
Paul Cooper
Kristen Knipe
Shaupu Qin
Vani Rajan
Shrutii Sarda
Tianjun Ye
![Page 2: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/2.jpg)
Outline
Homology based Gene Prediction
Ab initio based Gene Prediction
RNA Prediction
2
![Page 3: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/3.jpg)
Homology based Gene
Prediction
Erin Cook & Shrutii Sarda
![Page 4: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/4.jpg)
Follow-up: Overlapping Genes
Do we expect gene overlap in H. haemolyticus?
(aka is there gene overlap in H. influenzae
and/or other Haemophilus spp.?)
• Fukuda et al. 257 overlapping gene pairs
• Palleja et al. BMC Genomics 2008:
– 338 fully-sequenced prokaryotic genomes from
STRING 17% of all genes – some (~1%?) due to
misannotation
YES
4
![Page 5: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/5.jpg)
Follow-up: Plasmids of the
Haemophilus genus
• Range from 3-30 Mdal in size
• Mainly in type b strains of H. influenzae - imparts
ampicillin resistance
• Have been found to be associated with Tn2 – a
transposable element
• Tend to only have partial homology (~48-50%
similarity) with plasmids of related species
• Recent isolation of cryptic plasmids from H.
somnus strains (1-5 kb in length)
5
![Page 6: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/6.jpg)
Follow-up: % coding/non-coding regions of DNA
Species Tax. ID
Acc. ID
DNA
Length
(bp)
Protein
Count
Gene
Count
Total length
of genes (bp)
Percent of
non-coding
DNA (%)
H. influenzae
Rd KW20
71421
NC_000907.1
1830138 1657 1789 1345305 26.5
H. parasuis
SH0165
557723
NC_011852.1
2269156 2021 2299 1803054 20.6
H. somnus
2336
228400
NC_010519.1
2263857 1980 2065 1977672 12.7
H. ducreyi
35000HP
233412
NC_002940.2
1698955 1717 1838 1446108 14.9
H. influenzae
86-028NP
281310
NC_007146.2
1914490 1792 1899 1661757 13.3
H. influenzae
F3031
866630
NC_014920.1
1985832 1770 1892 1673628 15.8
H. influenzae
F3047
935897
NC_014922.1
2007018 1786 1896 1698588 15.4
H. influenzae
PittEE
374930
NC_009566.1
1813033 1613 1689 1446339 20.3
H. influenzae
PittGG
374931
NC_009567.1
1887192 1661 1735 1422240 24.7
6
![Page 7: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/7.jpg)
BLAST Get all k-mers
from query
sequence Gen. all
possible k-mers
with score > T
Find all hits of k-
mers in database
using FSA
Find close hits
(two-hit method)
Extend 2nd
hit ungapped
Dyn. prog. for
gapped alignment
aa
nt
Zvelebil and Baum. Understanding Bioinformatics.2008
7
![Page 8: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/8.jpg)
1. Get all k-mers from query sequence
1a. Gen. all possible k-mers with score > T
BLOSUM62 T=11 in gapped BLAST
http://www-users.math.umd.edu/~poorani/sampletalk/talk.html
Persemlidis and Fondon, Genome Biol. 2001. 8
![Page 9: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/9.jpg)
2. Find all hits of k-mers in DB using FSA
Zvelebil and Baum. Understanding Bioinformatics.2008
Example FSA to recognize CHH, CHY, or CYH
9
![Page 10: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/10.jpg)
3. Find close hits (two-hit method)
+: T=13
• : T=11
Altschul et al. Nuc Ac Res 1997
10
![Page 11: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/11.jpg)
4. Extend 2nd hit ungapped
Parameters: Xu and Sg
•Extension score is
monitored; if drops
below (max- Xu),
extension stops
• If extension score <
Sg, extension is
discarded
11
![Page 12: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/12.jpg)
5. Dynamic. prog. for gapped alignment
Smith-Waterman (not what BLAST uses, but demonstrative)
Seq. 1: GCCCTAGCG
Seq. 2: GCGCAATG
http://www.ibm.com/developerworks/java/library/j-seqalign/index.html
12
![Page 13: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/13.jpg)
BLAST output • Alignment Length: total length of alignment reported
(including gaps)
• % Identity: # identical nt‟s or aa‟s / alignment length
– aa alignments – positives: # subst‟s with „+‟ score in
substitution matrix / alignment length
• Bit Score: calculated from quality of alignment (gaps,
substitutions, etc.)
• e-Value: # of seq‟s with similar score expected to occur in
db by chance
• Coordinates: start and stop on both query and db
13
![Page 14: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/14.jpg)
QueryID
SubjID
Alignment vis.
E-val
Bit Score
%ID
14
![Page 15: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/15.jpg)
Preliminary Analysis:
M21127 454LargeContigs.fna
15
![Page 16: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/16.jpg)
Major Types of BLAST
Variant Query Sequence Type Database Sequence Type
*blastn Nucleotide Nucleotide
*blastp Protein Protein
*blastx Nucleotide translated to
protein
Protein
*tblastn Protein Nucleotide translated to
protein
tblastx Nucleotide translated to
protein
Nucleotide translated to
protein
* Types we will be using in our analysis
•Queries in fasta format
•Databases acquired from “formatdb” on fasta format
16
![Page 17: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/17.jpg)
Why using the BLASTs we‟re using?
• BLASTn – similar but not identical nt sequences
– not for finding homologous protein coding regions in other
organisms - because of the degeneracy of the genetic
code
– aa methods better for this
• BLASTp – most reliable, less conservative than BLASTn
• BLASTx - can provide strong evidence for the presence of a
homologous coding region, even between distantly related
genes
– is appropriate for use early in moderate and large scale
sequencing projects
• tBLASTn - useful for finding protein homologs in unannotated
nucleotide data
– especially suited to working with error prone data like draft
genomic sequences
17
![Page 18: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/18.jpg)
Pangenome/Panproteome
• Haemophilus somnus
– 129PT plasmid pHS129
– 2336
• Haemophilus ducreyi
– 35000HP
• Haemophilus parasuis
– SH0165
• Haemophilus influenzae
– Rd KW20
– 86-028NP
– PittEE
– PittGG
– F3031
– F3047
Combined files of gene/protein sequences from:
Panproteome: 16,003 sequences
Pangenome : 16,083 sequences
18
![Page 19: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/19.jpg)
Contigs
(nucleotide)
• H. inf. prot
•Panproteome
(protein)
• H. inf. prot
•Panproteome
(protein)
ORFs
(protein)
Process,
Filter,
Compare
BLAST Pipeline (part 1)
blastx
tblastn
blastp
•ORFs
•Contigs
(nucleotide)
• H. inf. genes
(nucleotide) blastn
19
![Page 20: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/20.jpg)
BLAST pipeline
• Some things may have slipped through the
cracks!
– Conserved domains?
– Homologs in more distantly-related species?
– Not as confident, but can still give potentially-useful
predictions
20
![Page 21: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/21.jpg)
nr database (NCBI)
• All non-redundant sequences from:
– GenBank CDS translations: annotated collection of
conceptual translations of all publicly available protein-
coding nucleotide
– PDB: Sequences derived from 3-dimensional structure
from Brookhaven Protein Databank
– SwissProt: UniProtKB/Swiss-Prot; manually annotated,
reviewed
– PIR: Part of UniProt consortium
– PRF: Protein Research Foundation, in Japan
NCBI databases: http://www.ncbi.nlm.nih.gov/blast/blast_databases.shtml
PDB: http://www.rcsb.org/pdb/home/home.do SwissProt: http://www.ebi.ac.uk/uniprot/
PIR: http://pir.georgetown.edu/ PRF: http://www.prf.or.jp/aboutdb-e.html
21
![Page 22: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/22.jpg)
Pfam database (Sanger Inst.)
• Collection of protein families (11,912 in release
24.0, Oct 2009)
• Domains – functional regions
• Conserved domains can indicate conserved
function
• Pfam-A: high quality, manually curated families
• Pfam-B: automatically-generated supplement
– Uses ADDA: Automatic Domain Decomposition
Algorithm
– Lower quality, but catch-all
pfam.sanger.ac.uk
22
![Page 23: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/23.jpg)
“Lonely”
ORFs
(protein)
ORFs with no hits, or
only hits below threshold
NR (protein)
Pfam (protein)
BLAST pipeline
(part 2)
Process,
Filter,
Compare
Integrate
with ab
initio,
RNA blastp
23
![Page 24: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/24.jpg)
Nitty Gritty: what is involved? 1. Find ORFs
(nt and prot)
getorf
2. Format
database
formatdb
3. Run BLAST in
each direction
blastall
4. Filter on e-val,
align. length, %ID
custom perl scripts
5. Find RBHs
public perl scripts 24
![Page 25: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/25.jpg)
ORF statistics (for prot; x3 for nt)
• Total: 75,590 ORFs
• Shortest: 10 residues
• Longest: 2,511
residues
• 1000+: 24
• 500-1000: 254
• 200-500: 972
• Avg size range of
proteins in
panproteome: 304.90
25
![Page 26: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/26.jpg)
Initial Filtering
• Minimum alignment length: 24nt or 8aa
• Minimum fractional alignment
(aligned length / query length): 0.5
• Maximum e-value: 0.0001 (nt), 0.05 (aa)
Quite liberal, but will give good first-pass overview.
26
![Page 27: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/27.jpg)
BLAST results (post-filter)
Analysis # hits # hits
Recip.
Best Hits
1373
1369
1495
1372
1336
1208
1467
5608
1496
12568
1141
4338
1489
12,286
1501
12,519
1439
3070
27
![Page 28: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/28.jpg)
Comparing/Processing BLAST results
• Many challenges!
• Lots of things to consider…
28
![Page 29: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/29.jpg)
Comparing/Processing Blast Results
• Overlapping ORFs
• Are the 1439 from hflu-refseq all present in full
panproteome blast?
– If so, take those 1439, add to “final” list, process the
other 1631
• Which ones do we trust as they are now?
• Which ones to run through nr and Pfam? What
criteria?
• Filter ORFs codon usage? Other parameters?
• How compare/combine blastx/tblastn with blastp?
• How integrate with other groups
• Other…
29
![Page 30: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/30.jpg)
Protein-coding Gene
Prediction by Ab initio
Kristen Knipe, Shaupu Qin & TianjunYe
![Page 31: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/31.jpg)
Ab Initio Gene Prediction Strategy
● Prediction: Use GENEMARKS,
GLIMMER, PRODIGAL to predict the
whole genome.
● Filter: Filter genes with
length>10000(possible bug of the
program)
● Merging: Merge the predicted result
● Validation: Use BLASTx to validate
the merged result.
31
![Page 32: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/32.jpg)
GeneMark.hmm and GeneMarkS
• Minimus2 output (Newbler and Mira)
• Ran GeneMark.hmm using:
• H. influenzae model
• H. influenzae 86 model
• H. ducreyi model
• Ran GeneMarkS
• H. haemophilus model created
32
![Page 33: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/33.jpg)
M19107 (123) M19501 (22) M21127 (38) M21621 (28) M21639 (54) M21709 (37)
0
500
1000
1500
2000
2500
3000
Haemophilus Genome
Nu
mb
er
of
Ge
nes
Number of Genes Predicted by GeneMark.hmm and GeneMarkS
GeneMark.hmm (H. ducreyi)
GeneMark.hmm (H. influenzae86)
GeneMark.hmm (H. influenzae)
GeneMarkS
CDC ID Species Disease
M19107 H. haemolyticus Asymptomatic
M19501 H. haemolyticus Asymptomatic
M21127 H. haemolyticus Pathogenic
M21621 H. haemolyticus Pathogenic
M21639 H. haemolyticus Pathogenic
M21709 H. influenzae Pathogenic
33
![Page 34: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/34.jpg)
34
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ATG GTG TTG
Perc
en
tag
e
Start Codon
Start Codon Usage
H. influenzae (38.2% GC)
H. influenzae 86 (38.2%GC)
M21709* (38.03% GC)
H. ducreyi (38.2% GC)
M19107* (38.7% GC)
M19501* (38.5% GC)
M21127* (38.6% GC)
M21621* (38.4% GC)
M21639* (38.6% GC)
*predicted by GeneMarkS
![Page 35: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/35.jpg)
*Calculated by Acua Software
0
0.2
0.4
0.6
0.8
1
1.2
GC
AG
CC
GC
GG
CT
TG
CT
GT
GA
CG
AT
GA
AG
AG
TT
CT
TT
GG
AG
GC
GG
GG
GT
CA
CC
AT
AT
AA
TC
AT
TA
AA
AA
GC
TA
CT
CC
TG
CT
TT
TA
TT
GA
TG
AA
CA
AT
CC
AC
CC
CC
GC
CT
CA
AC
AG
AG
AA
GG
CG
AC
GC
CG
GC
GT
AG
CA
GT
TC
AT
CC
TC
GT
CT
AC
AA
CC
AC
GA
CT
GT
AG
TC
GT
GG
TT
TG
GT
AC
TA
TT
AA
TA
GT
GA
Ala Cys Asp Glu Phe Gly His Ile Lys Leu MetAsn Pro Gln Arg Ser Thr Val TrpTyr Stop
Perc
en
tag
e
Codon and Amino Acid
Codon Usage Relative Frequencies (M19107)
35
![Page 36: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/36.jpg)
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
GC
AG
CC
GC
GG
CT
TG
CT
GT
GA
CG
AT
GA
AG
AG
TT
CT
TT
GG
AG
GC
GG
GG
GT
CA
CC
AT
AT
AA
TC
AT
TA
AA
AA
GC
TA
CT
CC
TG
CT
TT
TA
TT
GA
TG
AA
CA
AT
CC
AC
CC
CC
GC
CT
CA
AC
AG
AG
AA
GG
CG
AC
GC
CG
GC
GT
AG
CA
GT
TC
AT
CC
TC
GT
CT
AC
AA
CC
AC
GA
CT
GT
AG
TC
GT
GG
TT
TG
GT
AC
TA
TT
AA
TA
GT
GA
Ala Cys Asp Glu Phe Gly His Ile Lys Leu MetAsn Pro Gln Arg Ser Thr Val TrpTyr Stop
Perc
en
tag
e
Codon and Amino Acid
Codon Usage Frequencies (M19107)
*Calculated by Acua Software
36
![Page 37: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/37.jpg)
Prodigal Results on Three Assembly of
M19107 (Newbler, Mira3, Minimus2)
Newbler Mira3 Minimus
2 Average Gene
Length 853.6782 804.1442 858.9455
Total Gene
Number 1846 1969 1983
GC Content
0.6310 0.6335 0.6322
37
![Page 38: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/38.jpg)
-50
0
50
100
150
200
250
-500 500 1500 2500 3500 4500 5500 6500 7500 8500
newbler
mira
minimus
Gene Length Distribution of Different Assembly
Gene Length
38
![Page 39: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/39.jpg)
Three Prediction Software Results on
Minimus2 Assembly
(Prodigal, GMS, Glimmer3)
Prodigal GMS Glimmer3
Average Gene
Length 858.9455 827.9541 894.5465
Total Gene
Number 1983 2069 1945
39
![Page 40: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/40.jpg)
-50
0
50
100
150
200
250
-200 800 1800 2800 3800 4800 5800 6800 7800 8800 9800
Prodigal
GMS
Glim3
Gene Length Distribution of Minimus Assembly
Gene Length
40
![Page 41: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/41.jpg)
41
It‟s a good way to visualize gene
prediction results
After integrating the results of Homology
Search, we can easily find the difference
between the genes matched with known
proteins and those not.
![Page 42: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/42.jpg)
A Extreme Example
Marie Skovgaard. Et al. (2001) 42
![Page 43: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/43.jpg)
• Many short ORFs are
annotated as genes
Marie Skovgaard. Et al. (2001)
43
![Page 44: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/44.jpg)
Ab Initio Gene Prediction Strategy
● Prediction: Use GENEMARKS,
GLIMMER, PRODIGAL to predict the
whole genome.
● Filter: Filter genes with
length>10000(possible bug of the
program)
● Merging: Merge the predicted result
● Validation: Use BLASTx to validate
the merged result.
44
![Page 45: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/45.jpg)
Prediction Result(After Filter)
M19107 Number of Genes
GENEMARKS 2069
GLIMMER 1945
PRODIGAL 1983
45
![Page 46: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/46.jpg)
Merge Strategy
All Predicted Genes
Genes predicted in all
3 programs
Level 1: High Confidence
(HC)
Genes appear in GeneMark
and GLIMMER
Level 2: Medium
Confidence (MC)
Genes predicted in
only one program
Level 3: Low Confidence
(LC)
46
![Page 47: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/47.jpg)
Merge Result
M19107 Number of Gene
High Confidence
GENES 1058
Median Confidence
GENES 87
Low Confidence
GENES
800(glimmer)+926(gen
emarks)+925(prodigal)
=2651
47
![Page 48: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/48.jpg)
Validation Strategy
Run BLASTx on Merged Genes
Validated Genes with Different Confidence
Level
False Positive/ Pseudo-gene
48
E value < e-20 E value > e-20
![Page 49: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/49.jpg)
BLASTx SAMPLE (validated gene)
49
![Page 50: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/50.jpg)
BLASTx SAMPLE (validated gene)
50
![Page 51: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/51.jpg)
BLASTx SAMPLE(false positive/pseudo gene)
51
![Page 52: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/52.jpg)
Final Result: Currently Running
M19107 Confidence Level Number of Genes
Validated Genes
High Confidence
Median Confidence
Low Confidence
False positive/pseudo gene
52
![Page 53: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/53.jpg)
Delivery
● All the BLASTx,and BLASTn files
corresponding to validated genes,
could be used to do functional
analysis, multiple alignment, and
phylogenetic analysis.
53
![Page 54: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/54.jpg)
RNA Prediction
Paul Cooper & Vani Rajan
![Page 55: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/55.jpg)
Stats on Haemophilus genus from Rfam
55
Haemophilis Strain
# familes # entries tRNA rRNA Other
Rfam: tRNA tmRNA Rfam: 6S Rfam: 5S rRNA
23S rRNA
SS rRNA 5
SRP bact S15 RNaseP
GrpII Intron
ducreyi 35000HP 18 82 47 1 1 7 6 6 1 1 1 0 influenzae 86-028NP 20 96 57 1 1 7 6 6 1 1 1 0
influenzae PittEE 20 96 57 1 1 7 6 6 1 1 1 0
influenzae PittGG 20 94 55 1 1 7 6 6 1 1 1 0 influenzae Rd KW20 20 96 56 1 1 7 6 6 1 2 1 0
somnus 2336 18 80 47 1 1 6 5 5 1 1 1 1
somnus 129PT 19 80 47 1 1 6 5 5 1 1 1 0
Haemophilis Strain
sRNA
His leader TPP
riboswitch FMN riboswitch Sxy Alpha
Op.RBS Glycine Ribo PreQ1 GcvB Lr-Pk1? Lysine Moco ribo Thr Leader
ducreyi 35000HP 0 1 1 0 1 2 0 1 1 1 2 0
influenzae 86-028NP 2 3 1 1 1 2 1 1 1 1 1 0
influenzae PittEE 2 3 1 1 1 2 1 1 1 1 1 0
influenzae PittGG 2 3 1 1 1 2 1 1 1 1 1 0
influenzae Rd KW20 2 3 1 1 1 2 1 1 1 1 1 0
somnus 2336 0 1 1 1 1 2 0 1 1 1 2 1
somnus 129PT 0 2 1 0 1 2 1 1 1 1 2 1
![Page 56: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/56.jpg)
tRNA predictions: tRNA-Scan-SE
56
• Newbler denovo data
All_Contigs LargeContigs
M19107 50 46
M19501 48 45
M21127 52 49
M21621 52 49
M21639 51 48
M21639_2 51 47
M21709 48 45
H.Hemalyticus straintRNA: tScan-Se
![Page 57: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/57.jpg)
Distribution of tRNAs
57
Influenza
M19107 M19501 M21127 M21621 M21639 M21639_2 M21709
Ala 1 1 1 1 1 1 2
Arg 4 4 4 4 4 4 4
Asn 2 2 2 2 2 2 2
Asp 3 3 3 3 3 3 3
Cys 1 1 1 1 1 1 1
Gln 2 2 2 2 2 2 2
Glu 0 0 0 0 0 0 0
Gly 5 5 5 5 4 4 4
His 1 1 1 1 1 1 1
Ile 0 0 0 0 0 0 0
Leu 5 5 5 5 4 5 5
Lys 4 4 5 4 5 5 4
Met 4 4 4 4 4 4 4
Phe 1 1 1 1 1 1 1
Pro 2 2 2 2 2 2 2
SeC 1 1 0 1 1 1 1
Ser 4 4 4 4 4 4 4
Thr 2 2 2 2 2 2 2
Trp 1 1 1 1 1 1 1
Tyr 1 1 1 1 1 1 1
Val 2 1 4 5 5 3 1
Contigs 194 36 46 39 225 119 39
tRNAs Found
Asymptomatic Pathogenic
LARGE CONTGS FILE
![Page 58: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/58.jpg)
Distribution of tRNAs
58
M19107 M19501 M21127 M21621 M21639 M21639_2 M21709 86 028NP PitEE PitGG
Ala 2 2 2 2 2 2 3 4 4 3
Arg 4 4 4 4 4 4 4 4 4 4
Asn 2 2 2 2 2 2 2 2 2 2
Asp 3 3 3 3 3 3 3 3 3 3
Cys 1 1 1 1 1 1 1 1 1 1
Gln 2 2 2 2 2 2 2 2 2 2
Glu 1 1 1 1 1 1 1 3 3 4
Gly 5 5 5 5 4 4 4 5 5 4
His 1 1 1 1 1 1 1 1 1 1
Ile 1 1 1 1 1 1 1 3 3 2
Leu 5 5 5 5 4 5 5 5 5 5
Lys 4 4 5 4 5 5 4 4 4 4
Met 4 4 4 4 4 4 4 4 4 4
Phe 1 1 1 1 1 1 1 1 1 1
Pro 2 2 2 2 2 2 2 2 2 2
SeC(p) 1 1 1 1 1 1 1 1 1 1
Ser 4 4 4 4 4 4 4 4 4 4
Thr 2 2 2 2 2 2 2 2 2 1
Trp 1 1 1 1 1 1 1 1 1 1
Tyr 1 1 1 1 1 1 1 1 1 1
Val 3 1 4 5 5 4 1 5 5 5
Contigs 217 75 59 50 175 173 54
tRNAs Found
Asymptomatic Pathogenic Influenza
NCBI annotation
ALL CONTIGS FILE
![Page 59: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/59.jpg)
Individual Codons
• Only Glu and Val show different usage
• All other codons show usage similar to influenza
genomes
59
M19107 M19507 M21127 M21621 M21639 M21639_2 M21709 86_028NP PitEE PitGG
Glu GAA 1 1 1 1 1 1 1 3 3 4
Glu GAG 0 0 0 0 0 0 0 0 0 0
Val GTA 2 0 3 4 4 3 0 4 4 4
Val GTC 1 1 1 1 1 1 1 1 1 1
Val GTG 0 0 0 0 0 0 0 0 0 0
Val GTT 0 0 0 0 0 0 0 0 0 0
tRNAs found: codons
Asymptomatic Pathogenic Influenza
![Page 60: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/60.jpg)
rRNA Difficulties
• Highly conserved functional regions
• Followed by a short hyper-variable area.
• Multiple Operon Copies (~55 copies have been found in C.elegans at Chr 1 w/ 275 total)
• A closure of C. violaceum with 57 contigs found: (7 contigs ended with 5SRNA, 3 with 16S)
Chang-Shung Tung1, Simpson Joseph2 & Kevin Y. Sanbonmatsu1 All-atom homology model of the Escherichia coli 30S ribosomal subunit Nature Structural Biology 9, 750 - 755 (2002)
C. elegans Sequencing Consortium (1998). Genome sequence of the nematode C. elegans: a platform for investigating biology. The C. elegans Sequencing Consortium. Science 282, 2012–
2018.
C. Woese Microbiol Bacterial evolution. Rev. 1987 June; 51(2): 221–271.PMCID: PMC373105 60
![Page 61: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/61.jpg)
RNAmmer
• Online submissions of: 10k seq and 20m NT
• Prescreening followed, by HMM
• Bacterial Training: 82% Actinobacteria,
Firmicutes, Proteobacteria
• Highest accuracy in 16S, then 23S
61
![Page 62: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/62.jpg)
rRNA Results
• ##gff-version2
• ##source-version RNAmmer-1.2 (IRIX64)
• ##date 2011-02-19
• ##Type DNA
• # seqname source feature start end score +/- frame
attribute
• # ---------------------------------------------------------------------------------------------------------
• contig00179 RNAmmer-1.2 (IRIX64) rRNA 1572 1686 82.9
+ . 5s_rRNA
• contig00031 RNAmmer-1.2 (IRIX64) rRNA 45 159 63.0
+ . 5s_rRNA
• contig00211 RNAmmer-1.2 (IRIX64) rRNA 172 1699 1894.4
+ . 16s_rRNA
• # ---------------------------------------------------------------------------------------------------------
Preliminary rRNA Results Strain 5S 16S 23S Contigs M19107-hae-AS 2 1 0(1) 217 M19501-hae-AS 1(3) 0(0) 0(0) 75 M21127-hae-P 2 1 1 59 M21621-hae-P 2 1 1 50 M21639-hae-P 2 1 1 175 M21709-Infl-P 3(4) 1 2 52 Rfam influenza(4) 7 -- 5 ---- JCVI infl-KW20 Rd 5 5 5 ---
62
![Page 63: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/63.jpg)
Ribosome Subunit Assembly
63
![Page 64: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/64.jpg)
Clustering of Influ-KW20
• Lengths consistent:
114/1538/2896
• Spacing between 5S and 23S: 247
• Spacing between 23S and 16S: 3x724,
479,2017
• Clusters vary in distance from: 27k-1,800k
A C E F B D 64
![Page 65: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/65.jpg)
Pathogenic Hae-23S Alignment
65
Cladogram with Distance
http://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?distance=true&tree=phylogram&jobId=clustalw2-I20110221-192123-0426-9064345-
oy&tool=clustalw2&analysis=tree. EMBl-EBI
![Page 66: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/66.jpg)
Has rRNA been associated with
virulence?
• V. vulnifucus (shellfish/humans)-16S type B
highly associated with virulence over type A1
• H. aegyptius, 15 rRNA (16+23) gene
restriction patterns, one is associated with
most cases of BPF2
• V. cholerae O139 toxins linked to the
ribotype BgII 3
1Nilsson WB, Paranjype RN, DePaola A, Strom MS. Sequence polymorphism of the 16S rRNA gene of Vibrio vulnificus is a possible indicator of strain virulence. J Clin Microbiol.
2003 Jan;41(1):442-6.
2Leen-Jan van Doorn et. Al, Accurate Prediction of Macrolide Resistance in Helicobacter pylori by a PCR Line Probe Assay for Detection of Mutations in the 23S rRNA Gene:
Multicenter Validation Study. Antimicrob Agents Chemother. 2001 May; 45(5): 1500–1504.
3Farquue et al. Molecular analysis of rRNA and cholera toxin genes carried by the new epidemic strain of toxigenic Vibrio cholerae O139 synonym Bengal.J Clin Microbiol. 1994
Apr;32(4):1050-3. 66
![Page 67: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/67.jpg)
Future: rRNA
• Determine coverage of rRNA areas in assembly
• Check contig edges for rRNA partial matches
• Map rRNA to contigs to determine if distance in
between could represent missed rRNA
• Run other methods of rRNA verification
67
![Page 68: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/68.jpg)
Rfam output
##gff-version 3
# rfam_scan.pl (v1.0)
# command line: /usr/bin/rfam_scan-1.0.2.pl -blastdb /storage2/db/rfam/rfam -o
454LargeContigs.rfam /storage2/db/rfam/Rfam.cm 454LargeContigs.fna
# CM file: /storage2/db/rfam/Rfam.cm
# query FASTA file: 454LargeContigs.fna
# start time: Tue Feb 22 12:28:46 EST 2011
# end time: Tue Feb 22 13:18:54 EST 2011
contig00211 Rfam similarity 191 721 344.88 + . evalue=4.15e-42;gc-content=53;
id=SSU_rRNA_5.1;model_end=486;model_start=1;rfam-acc=RF00177;rfam id=SSU_rRNA_5;
score=344.88
Contig00203 Rfam similarity 15472 15836 280.07 + . evalue=6.18e-40;gc-content=49;
id=tmRNA.1;model_end=359;model_start=1;rfam-acc=RF00023;rfam-id=tmRNA;score=280.07
contig00025 Rfam similarity 2611 2987 305.02 + . evalue=2.09e-40;gc-
content=53;id=RNaseP_bact_a.1;model_end=367;model_start=1;rfam-acc=RF00010;rfam-
id=RNaseP_bact_a;score=305.02
68
![Page 69: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/69.jpg)
Rfam preliminary results
• Validated tRNAs found by tRNAScan-SE
• Had difficulty finding long rRNAs
69
Rfam: tRNA tmRNA Rfam: 6S Rfam: 5S rRNA 23S rRNA SS rRNA 5 SRP bact S15 RNaseP GrpII Intron
ducreyi 35000HP 18 82 47 1 1 7 6 6 1 1 1 0
influenzae 86-028NP 20 96 57 1 1 7 6 6 1 1 1 0
influenzae PittEE 20 96 57 1 1 7 6 6 1 1 1 0
influenzae PittGG 20 94 55 1 1 7 6 6 1 1 1 0
influenzae Rd KW20 20 96 56 1 1 7 6 6 1 2 1 0
somnus 2336 18 80 47 1 1 6 5 5 1 1 1 1
somnus 129PT 19 80 47 1 1 6 5 5 1 1 1 0
M19107 19 71 50 1 1 2 1 1 1 1 1 0
M19501 19 72 48 1 1 3 0 0 1 1 1 0
M21127 20 75 52 1 1 2 1 1 1 1 1 0
M21621 19 74 52 1 1 2 1 1 1 1 1 0
M21639 20 74 51 1 1 2 1 1 1 1 1 0
M21639_2 21 76 51 1 1 2 1 1 1 1 1 0
M21709 22 77 48 1 1 4 2 1 1 1 1 0
rRNAtRNAHaemophilis Strain # familes # entries Other
![Page 70: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/70.jpg)
Rfam preliminary results
• Can clearly see that some sRNA in Haemalyticus
come from influenza while others are from diff
Haemophilus species
70
His leader TPP riboswitch FMN riboswitch Sxy Alpha Op.RBS Glycine Ribo PreQ1 GcvB Lr-Pk1? Lysine Moco ribo Thr Leader Rrt IsrK
ducreyi 35000HP 0 1 1 0 1 2 0 1 1 1 2 0 0 0
influenzae 86-028NP 2 3 1 1 1 2 1 1 1 1 1 0 0 0
influenzae PittEE 2 3 1 1 1 2 1 1 1 1 1 0 0 0
influenzae PittGG 2 3 1 1 1 2 1 1 1 1 1 0 0 0
influenzae Rd KW20 2 3 1 1 1 2 1 1 1 1 1 0 0 0
somnus 2336 0 1 1 1 1 2 0 1 1 1 2 1 0 0
somnus 129PT 0 2 1 0 1 2 1 1 1 1 2 1 0 0
M19107 1 2 1 0 1 2 0 1 1 1 1 1 0 0
M19501 1 4 1 1 1 2 0 1 1 1 1 1 1 0
M21127 1 3 1 1 1 2 0 1 1 1 1 1 0 0
M21621 1 3 1 0 1 2 0 1 1 1 1 1 0 0
M21639 1 3 1 1 1 2 0 1 1 1 1 1 0 0
M21639_2 1 3 1 1 1 2 0 1 1 1 1 1 0 2
M21709 2 3 1 1 1 2 1 1 1 1 1 1 1 0
Haemophilis Strain sRNA
![Page 71: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/71.jpg)
sRNA prediction
• sRNAPredict3, sRNAScanner, nocoRNAc
• Some problems with inputs
– Require coordinates of protein coding genes
– Descriptions of secondary structures
– Positive training samples to create PWM
(sRNAScanner)
– Biggest problem: requires MSA to find
consensus
71
![Page 72: Gene Prediction: Preliminary Results](https://reader031.vdocuments.us/reader031/viewer/2022020912/6203475f24f6b61e9c662643/html5/thumbnails/72.jpg)
Future: sRNA
2ndary Structure
• Blast
• ClustalW (MSA)
Prediction
• RNAz
• QRNA
Filter • nocoRNAc
72