Advancing Science with DNA Sequence
Data Curation in IMG-ER
Natalia IvanovaMGM Workshop
September 28, 2011
Advancing Science with DNA Sequence
Tricky question
• What do you need to do data curation in IMG?a) I-phoneb) PhD in Computer Sciencec) supernatural powers
• Correct answer: you need an IMG accounthttp://img.jgi.doe.gov/er
Advancing Science with DNA Sequence
1. Gene modelsa) Add a geneb) Make a gene pseudogene or “obsolete” (=delete it)2. Functional annotations:c) Product namesd) EC numberse) Gene symbolsIf you believe something else needs to be changed (genome
name, taxonomy, etc.) – please use IMG Questions/Comments link
What can’t be changed: automated assignments to protein families (Pfam, COGs, TIGRfam, InterPro, SEED assignments, KO assignments)
What can be curated in IMG-ER?
Advancing Science with DNA Sequence
Center point for curation – Gene Cart
Advancing Science with DNA Sequence
• Product Name is free text (but see GenBank requirements http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit_annotation.html)
• Prot Description is free text (goes to “note” in GenBank submission)
• EC number and PUBMED ID – see explanation
• Notes are free text (goes to “note” in GenBank submission)
• Gene symbol is “gene name” – 4 letter abbreviation; goes to “gene” in GenBank submission
Advancing Science with DNA Sequence
How to find the genes that need curation?
Two possible scenarios:• You have submitted a genome to IMG-ER
and want to have the best annotations possible for it (e. g. for GenBank submission)
• You’re an expert and know everything about a certain protein family (families) = “community service”
Advancing Science with DNA Sequence
Curation of genome annotations
Compare Gene Annotations
find genome
Genome Statistics
review Gene Pages
add to Gene Cart
refine gene setFind Genomes:
• Genome Browser• Genome Search
• “Hypothetical protein”, but with some evidence
• Non-hypothetical protein, but no evidence
w/o enzymes but with candidate KO
based enzymes • Protein families• Homologs/orthologs• Gene Neighborhoods
Advancing Science with DNA Sequence
Why do you want to review annotations?
• Most IMG pipelines are optimized for specificity, so they are more likely to have false negatives, but generate few false positives
• Compare Annotations– Product name is a consensus of multiple assignments:
BLASTp, TIGRfam, COG, Pfam– Sources of false negatives - cutoffs: TIGRfam trusted cutoffs
are quite stringent; COG doesn’t have trusted cutoffs; BLASTp cutoff of 50% identity
• Candidate genes with KO annotations – sources of false negatives– Cutoffs for % identity and alignment length
Advancing Science with DNA Sequence
Curation of annotation in one genome (or a set of genomes)
a) Your favorite genes (experimental verification, etc.) -> use Find Genes, Gene Search or BLAST
b)“Compare Annotations” on Organism Details page
c) “Candidate genes with KO annotations” on Organism Details page
d)PhyloProfiler
Advancing Science with DNA Sequence
A shortcut for product name/EC number assignments based on KO
Advancing Science with DNA Sequence
Example of a missed gene
• Run PhyloProfiler of Deinococcus geothermalis as a query, Deinococcus hopiensis as target (with no homologs in)
• Select Dgeo_0119 as a sequence to check whether a homolog of this gene was missed in Deinococcus hopiensis
Advancing Science with DNA Sequence
Adding missed genes - contd
• Use graphical viewer to check the translation
• Adjust the start if other start codons with better RBS exist upstream
Advancing Science with DNA Sequence
Reviewing your annotations
• Organism Details page -> Genome Statistics
• MyIMG
Advancing Science with DNA Sequence
IMG curation exercises
Go to the link in the usual place:http://genomebiology.jgi-psf.org/Content/MGM-10.Sep2011/agenda.html
The first 2 pages – questions without answers; the rest is cheat sheet