agbiodata coordinated innovation networks grant...2018/09/05 · arabidopsis thaliana genome...
TRANSCRIPT
![Page 1: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/1.jpg)
AgBioData Coordinated Innovation Networks Grant
Title: AgBioData: A Coordinated, Collaborative and Innovative Network of Genomic, Genetic and Breeding Databases for Enhanced Agricultural
Research Outcomes
USDA National Institute of Food and Agriculture (NIFA)
Food and Agriculture Cyberinformatics and Tools Initiative (FACT) https://nifa.usda.gov/program/fact
FACT focuses on data science to enable systems and communities to effectively utilize data, improve resource management, and integrate new technologies and approaches to further U.S. food and agriculture enterprises.
WHAT IS A COORDINATED INNOVATION NETWORK? Coordinated innovation networks projects foster communities that address critical areas by bringing together experts from different disciplines to identify solutions.
![Page 2: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/2.jpg)
Letter of Intent submitted July 25, 2018. Accepted August 15, 2018
Dorrie Main is Project Director. If funded funds will go to:
• Wash State for yearly workshop, and website support (10% FTE)
• Iowa State University for 50% FTE coordinator (J Campbell)
• Phoenix Bioinformatics for Database Sustainability Pilot Study
![Page 3: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/3.jpg)
1: Develop and implement standards for AgBioData data curation 2: Establish common practices for broad use of ontologies, specifically GO, PO, TO and PATO, and provide tools and training for researchers 3: Establish metadata standards across AgBioData members and promote compliance 4: Identify opportunities for a federated model of data exchange for AgBioData member databases. 5: Identify funding options for long term database sustainability 6: Work with funding agencies and journals to enhance data provision by researchers
Objec&ves:
![Page 4: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/4.jpg)
1. Sept 15: Steering Committee Draft to Whole group 2. Oct 19: Submitted to WSU Grants Office
So, please plan to help in the area most near to your interests between Sept 15 and October 19. THANKS!!!! We would like to include a representative from each database/resource as an official collaborator on the proposal – will be in touch soon to request letter if this is agreeable Note about the Steering Committee: We are slowly thinking about governance bylaws for AgBioData. We WILL rotate people onto the Steering committee (maybe by election), and we hope to start this process as soon as the grant is in.
Timeline:
![Page 5: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/5.jpg)
TODAYSMEETING:GenomeNomenclatureinvariousorganismsEthy/Maggie:MaizeTanya:ArabidopsisSook:Prunus,CoIonMarcela:Many-EnsemblePankaj:Many-PlanteomeTaner:Wheat
![Page 6: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/6.jpg)
Nomenclatureprotocolsformaizegenomes
EthalindaCannon&
MargaretWoodhouseMaizeGDB
![Page 7: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/7.jpg)
Thestateofmaizegenomeassemblies
MaizeGDBholds10complete,reference-qualitymaizegenomeswithannotaDon.• 3havemulDpleversions.• 4newgenomesarenearingcompleDon.• 25+newgenomesareinprogress.
Wecouldn’tleavenamingtotheWildWest.
ThemaizecommunityhasalonghistoryofseMng,and[mostly]sDckingtoagreed-uponnomenclaturerules.AmaizenomenclaturecommiReemaintains,andalongwithMaizeGDB,aRemptstoenforcetherules.
![Page 8: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/8.jpg)
ThestateofmaizenomenclatureThefirst3B73referencegenomes:
V1:genome=B73RefGen_v1,AGPv1;annotaDon=4aV2:genome=B73RefGen_v2,AGPv2;annotaDon=5bV3:genome=B73RefGen_v3,AGPv3;annotaDon=5b+
Thegenomenameswereconsistent,buttheformatdidn’tallowforgenomeassembliesforaddiDonalmaizelines.
TheannotaDonsnameswereunrelatedtogenomenames.
The“+”in“5b+”isareservedURLcharacterandissDllcausingoccasionalproblemsinMaizeGDBcode.
Weneededadifferentnomenclaturesystem.
![Page 9: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/9.jpg)
1.Namingmaizegenomeassemblies
ObjecDves:• Unique• Consistent• Human-andmachine-readable
• Short• Enforcedacrossthemaizeresearchcommunity
• Noreservedsymbols
Thistaskwassurprisinglydifficult.
![Page 10: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/10.jpg)
1.Namingmaizegenomeassemblies
AssemblyidenDfiers:(CauDonwhenencodinginformaDoninidenDfiers)
Z(species)-(culDvaroraccession)-(DRAFT/REFERENCE)-(group)-(version)Forinstance,B73version4isofficiallynamed:
Zm-B73-REFERENCE-GRAMENE-4.0Zm:Zeamays.B73:TheculDvar.REFERENCE:ThisisthereferencegenomeforB73(“reference”meansapseudomoleculeassemblyasopposedtounassembledscaffolds,whicharedenoted“DRAFT”).GRAMENE:thegroupthatsequencedversion4.4.0:ThisisthefourthversionoftheB73genome
![Page 11: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/11.jpg)
2.NamingmaizegenomeannotaLons
AnnotaLonshaveashortername.
• Uniqueacrossallgenomeassemblies.
• IdenDfyofgenomeassemblyisbuiltintothename.
• Usedtoprefixgenemodelnames
• MinimalinformaDonencodedinname
![Page 12: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/12.jpg)
2.NamingmaizegenomeannotaLonsAnnotaDonidenDfiers:
Z(species)(series)(assemblyversion)
Forinstance,ForthemaizelineW22(thefourthgenomesequencedsincethenewnamingconvenDonswereestablished),therearetwoversionsofthegenome.Therefore:• Thefirstassemblyversion:Zm0004a• Thesecondassemblyversion:Zm0004b
004–fourthmaizegenomeassembleda/b–version1and2
![Page 13: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/13.jpg)
2.NamingmaizegenomeannotaLonsThirdpartyannotaDons
WeacceptthenamingconvenDonsusedbythirdpartyannotators,forexample,GenBank.
![Page 14: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/14.jpg)
3.NamingmaizegenemodelsObjecLves:
• Shortnames.
• Assemblybuiltintoname.
• Uniqueacrossallgenomeassemblies.
• Appliesonlytothe“reference”annotaDon(othersusetheirownconvenDons,e.g.NCBI).
• NumberingshouldnotimplygenelocaDonororder*.
*thisiscontroversial!
![Page 15: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/15.jpg)
3.Namingmaizegenemodels
GenemodelnamesaregeneratedbytakingthegenomeannotaDonidenDfierandappendingaunique,numericalidenDfiertoit:
Z(species)(series)(version)(randomgeneidenDfier)
TheuniquenumericalidenDfierdisDnguishesonegenemodelfromanother.Forexample,forB73version4:
Zm00001d000001,Zm00001d020002,Zm00001d001224,etc
![Page 16: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/16.jpg)
3.Namingmaizegenemodels
Versioning:disLnguishingdifferentannotaLonversionsinthesameassembly:
• Assume(hope!)thatnewannotaDonversionsretainthenamesofidenDcalgenemodels.
• Splitormergedgenemodelsmustgetnewnames.
• ToavoidconfusionthatcanarisewhentherearedifferencesbetweenfilesdownloadedatdifferentDmes,versionsneedtobemadeexplicit.
![Page 17: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/17.jpg)
3.NamingmaizegenemodelsVersioning:disDnguishingdifferentannotaDonversionsinthesameassembly:
Z(species)(series)(assemblyversion).(annotaDonversion)
Forexample,thecurrentannotaDonforB73v4is:Zm0001d.2Zm:Zeamays0001:Firstgenomeassemblyinseriesd:4thversionoftheassembly2:secondversionoftheannotaDon
![Page 18: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/18.jpg)
3.NamingmaizegenemodelsVersioning:disDnguishingdifferentannotaDonversionsinthesameassembly:
• Assume(hope!)thatnewannotaLonversionsretainthenamesofidenLcalgenemodels.
• Splitormergedgenemodelsmustgetnewnames.
• ToavoidconfusionthatcanarisewhentherearedifferencesbetweenfilesdownloadedatdifferentLmes,versionsneedtobemadeexplicit.
Theannota)onversionisnotindicatedinthegenemodelname.
![Page 19: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/19.jpg)
4.Namingpan-genes
Apan-genomeisthecumulaDvediversitywithinallsequencedmaizeculDvarsinZeamays,includingallannotatedgeneswithineachculDvar.Pan-genesaregenesthathaveorthologsinmulDplemaizeculDvarswithinthepan-genome.
![Page 20: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/20.jpg)
4.Namingpan-genes:ideas
Forcurated*genes(genesthathavebeencharacterizedgeneDcally),pan-genesofcuratedgeneswillbegiventheapprovedgenesymbol(suchaslg1)ineverymaizegenomewherethegeneispresent.
*human-curated
![Page 21: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/21.jpg)
4.Namingpan-genes:ideasFornon-curated*genemodels,allgenemodelsacrossallmaizelinesthatareintheexpectedsyntenicregionwilleither:• begiventhegenemodelIDofthefirstgenomesequencedthathasthesyntelog;or
• begivenanewidenDfier(suchasZ0123456)thatwillbesharedasanaliasorsynonymamongallsyntelogs(thesesyntelogswillsDllretaintheiruniqueannotaDonidenDfiertoo)
*Automatedprocess,nohumancuraLon
![Page 22: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/22.jpg)
Arabidopsisthalianagenomeannota)on
Thestorysofar
![Page 23: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/23.jpg)
HistoryofA.thalianagenomeannota)on
• Originalannota)on:– 2000:Comple)onofgenomesequence,TIGR1genomerelease
• Reannota)on(10versions)– 2005:TAIR6genomerelease– 2016:Araport11genomerelease
• 20??:Nextgenomerelease
![Page 24: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/24.jpg)
SourceGermplasm=Col-0
• Past:Mul)pleCol-0stocksused,unclearhowthesewererelatedtoeachother.
• Proposal:Col-0seedstockCS70000designatedasthereferenceseedstock.
![Page 25: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/25.jpg)
NamingConven)ons
• AT[1-5,C,M)gNNNNN• Locus:At1g01020• Genemodel:At1g01020.1,At1g01020.2,…
![Page 26: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/26.jpg)
• Currentlynodis)nc)oninnamingbetweengenomeassembly(pseudo-chromosomes)andgeneannota)on(individualgenecalls)– TAIR10genomeassembly(sameasTAIR9)– TAIR10geneannota)on(differentfromTAIR9)
• Notidealapproach–needtobeabletoversion/namethemindependently
![Page 27: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/27.jpg)
Exis)ng/Upcomingissues• Lersequencepublished–usedAGIiden)fierapproachbut
inconsistentusebetweenCol-0andLeruse– Col-0At1g01040!=LerAt1g01040– Createdmappingfilewithauthor-suppliedsourcefiles
• ‘Pla)num’standard,PacBiosequences(denovoassembly)ofCol-0,Ler,upto50otherecotypescoming–needforaconsistentnomenclatureapproach
• ManyotherecotypesaresequencedwithCol-0referenceguidedassembly(1001genomesproject)andcould/shouldbeannotated
• Varia)oningeneinser)on/dele)on/rearrangement/modifica)oninotherecotypesvs.Col-0
![Page 28: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/28.jpg)
Handlinguserfeedback
• Accumulatesugges)onsbetweenreleases– Needfortracking,verifica)on,standardsforacceptanceofedits
• Incrementalupdates?Howtopropagatetheseefficientlyandeffec)vely–NCBI/DDBJ/EMBLandothers
![Page 29: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/29.jpg)
GENOME NAMING GUIDELINE IN GDR (AND COTTONGEN, CGD, CSFL)
Sook Jung Washington State University
![Page 30: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/30.jpg)
Genome Naming Guideline ◦ [Genus] [species] genome v[assembly version].a[annotation-version]
◦ Example: Prunus persica genome v2.0.a1
◦ Where: ◦ Genus = the genus of the organism ◦ Species = the species of the organism
◦ Assembly version = the version of the assembly with a major and minor number. The major version is incremented with major changes or releases of the assembly and the minor number is incremented when minor changes are made to the assembly.
◦ Annotation-version = a single numeric value that is incremented each time a new annotation is released. It restarts at 1 each time the assembly version is incremented.
![Page 31: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/31.jpg)
It works most of the time (GDR has 21 genome assemblies from 7 crops and 14 species)
◦ Prunus persica Genome v2.0.a1 ◦ Prunus persica Genome v1.0 ◦ Prunus avium Genome v1.0.a1 ◦ Pyrus communis Genome v1.0
![Page 32: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/32.jpg)
Multiple genome assemblies from the same species – add accession name
◦ Malus x domestica Genome v1.0.a1 ◦ Malus x domestica Genome v2.0.a1 ◦ Malus x domestica Genome v3.0.a1 ◦ Malus x domestica GDDH13 Whole Genome v1.1 ◦ Up to v3.0 was done using heterozygous Golden Delicious, and the newest version
was done using GDDH13, a doubled-haploid Golden Delicious tree.
![Page 33: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/33.jpg)
Multiple genome assembly with the same accession? – add institution name.
◦ CottonGEN example ◦ Gossypium hirsutum (AD1) acc 'TM-1' genome CGP-BGI v1.1 assembly & v1.0
annotation ◦ Gossypium hirsutum (AD1) acc 'TM-1' genome NAU-NBI v1.1 assembly & v1.1
annotation ◦ Gossypium hirsutum (AD1) acc 'TM-1' genome UTX-JGI v1.0 assembly v1.0
◦ Same accession and institution? Use published name. ◦ Rosa chinensis Genome v1.0 ◦ Rosa chinensis Old Blush homozygous Genome v2.0 ◦ Rosa chinensis Old Blush Illumina genome v1.0
![Page 34: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/34.jpg)
Sometimes authors skip versions when they publish.. ◦ Fragaria vesca Genome v1.0.a1
◦ Fragaria vesca Genome v1.1.a1
◦ Fragaria vesca Genome v1.1.a2
◦ Fragaria vesca Genome v2.0.a1
◦ Fragaria vesca Genome v2.0.a2
◦ Fragaria vesca Genome v4.0.a1
![Page 35: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/35.jpg)
Some times authors publish duplicated names
◦ Fragaria vesca Genome v1.0.a1
◦ Fragaria vesca Genome v1.1.a1
◦ Fragaria vesca Genome v1.1.a2
◦ Fragaria vesca Genome v2.0.a1 ◦ Fragaria vesca Genome v1.1.a2 (Darwish et al. 2014) track – we originally called it Fragaria
vesca Genome v2.0.a2
◦ Fragaria vesca Genome v2.0.a2 ◦ Fragaria v2.0.a2 (Li et al. 2017)
![Page 36: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/36.jpg)
conclusion
◦ Need to refine the naming system ◦ Need to work with journals and communities
![Page 37: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/37.jpg)
Perspectives ongenomenomenclature from56
speciesMarcelaKarey Tello-Ruiz,PhD
September5th,2018For AgBioData Consortium Nomenclature Discussion
![Page 38: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/38.jpg)
VariousGeneIDssystems@Gramene/EnsemblPlants• ReusegeneIDsassignedbythegenomeprojectsresponsibleforthe
annotationandprovidedtotheINSDC:• NCBI:GenBank• EMBL-EBI:ENA• DDBJ:DDBJ
• Gramene/PlantReactome/EnsemblPlantsworkwiththeINSDC(mainlyENA)toresolveIDs(prioritizingUniProtproteome-basedIDs,ifapplicable).
• Adoptasystemagreed/adoptedbythecommunityearlyon• Provideamechanism(IDconverter)formappingbacktoIDsinprevious
versions,whenpossible• MODsandotherresourcesworkwithGenBank/ENAtogetaclearpictureof
howIDsareconstructed
![Page 39: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/39.jpg)
FulltableinGramene’sreleasenotes(56species)http://www.gramene.org/release-notes-58
Species Assembly Gene Annotation Gene ID example
Arabidopsis lyrata Araly1.0 Araly1.0 Al_scaffold_0001_1000
Arabidopsis thaliana TAIR10 AraPort11 AT3G52430
Chlamydomonas reinhardtii
v5.5 (GCA_000002595.3)
JGI via ENA CHLRE_15g637761v5
Oryza sativa japonica IRGSP-1.0 RAP-DB Os05g0113900
Solanum lycopersicum
SL2.50 ITAG2.3 Solyc01g087250.2
Sorghum bicolor V3 (GCA_000003195.3)
JGI via ENA SORBI_3004G141800
Triticum aestivum IWGSC v1.0 IWGSC v1.0 TraesCS3D01G273600
Vitis vinifera IGGP 12x 2012-07-CRIBI VIT_01s0010g03900
Zea mays B73_RefGen_v4 MAKER-CSHL Zm00001d048577
● Inthenearfuture,wewillseetheassembliesstabilizing,buttheannotationsupdatedbasedonevidence.
● Geneannotations:protein-coding,non-codingRNAs,pseudogenes.
● Regulatoryfeatures???
![Page 40: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/40.jpg)
Challenges1. LackofcontinuityingeneIDs/names2. Cautiononincludingtoomuchinformationinaname3. Propagatinggeneinformationbetweenversionsofthereferenceassembly.
Manydifferentapproachesbasedonthequalityofthedraftassemblyandthecomplexityoftheorganism.
4. ChallengesassociatedwithinitialsubmissionandupdatestoNCBI5. Notallproteomesfullyrepresented/updated@UniProt6. PanGenomes:Movingfromasinglereferencetomanyreferences
• Sameasabovebutmorecomplex• Learnfromexperience:Arabidopsisthaliana &A.lyrata• Grape:suggestiontoincludespeciesandcultivarnames(prefix/suffixlettercodes)andnumericIDacrossspecies(somegenemodelswillbespecies-specific).
![Page 41: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/41.jpg)
Proposednamingstrategyforthegrapepangenome• Annotate species and varieties in a gene ID.
• Species: Community follows UniProt recommendation. First two letters of species name as prefix. Example: Vitci for Vitis cisera (ask UniProt to add species if not in the list: http://www.uniprot.org/docs/speclist).
• Cultivar: Annotate the cultivar in the suffix (2-4 letters code), as they are basically considered different alleles. Examples: Use the prefix Vitvi (V. vinifera) for cabernet sauvignon (cs), pinot noir (pn), and flame, then suffix for cultivar (Vitvi00g0000-cs is a cab gene). Concord is a V. labrusca(Vitla), add suffix.
• Relating orthologous genes. Compare the sequences to the latest release of a gene annotation for IDs to match. A challenging workflow that requires renaming and transformation of the IDs...
![Page 42: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/42.jpg)
Convertassembly-specific&&
cross-referencegeneIDs
![Page 43: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/43.jpg)
Suggestionstowardsstandardizinggenenomenclature• OncethedataisreleasedfromINSDC,everyonemustuse
thesameID• Prioritizerepresentationof(up-to-date)proteomesin
UniProt• GeneIDsindependentofassemblyversion,genomic
location,function&orthology• Adoptstandardsfromgoodqualitynomenclaturesystems
(e.g.,human,yeast,Arabidopsis)
![Page 44: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/44.jpg)
Suggestionstowardsstandardizinggenenomenclature• StandardtoolstoconvertIDsacrossresources,
assemblies,etc.• Communitycollaborativeplatform - Builda'datawiki'
andletcommunitiesprovidetheirownnames(DanBolser)
![Page 45: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/45.jpg)
GeneNames:Ensemblgenenameprojectionpipeline
• Upto90%ofgenenamescouldbeprojectedfromcloselyrelatedspecies(e.g. betweentherice).
• Forwhichpairsofspeciesdoesitmakesensetoprojectgenesbetween?• E.g.overwhattaxonomicrangeandwithwhatstringency?
![Page 46: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/46.jpg)
Gene NomenclaturePlanteome Strategy
Pankaj JaiswalOregon State University
September 5, 2018AgBioData Group Meeting
NSF #1340112
![Page 47: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/47.jpg)
Genomes
• Reference genomes: • Accession/germplasm/ecotype-specific
• Non-reference genomes • For genetic diversity (Mainly for SNPs but if fully sequenced can be used for
finding new and novel genes)• Accession/germplasm/ecotype-specific
• Pan-genomes• Species-level• Genus-level
![Page 48: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/48.jpg)
Gene set changes observed• Identify new gene (includes coding and noncoding)• Identifying null allele (gene missing in an ecotype/accession)• Existing genes
• Start and stop coordinates may change• New/altered UTRs• Different transcript isoforms
• Novel splicing• New transcribed region• New peptide
• Mergers • Splits (may need adding a minimum of one new gene) • Obsolete/delete
![Page 49: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/49.jpg)
Characteristics
Molecular Function(Gene Ontology)
Biological Process(Gene Ontology)
Cellular Component(Gene Ontology)
Expressed at Growth Stage(Plant Ontology
Expressed in Plant Structure(Plant Ontology)
Phenotype(Trait Ontology)
Taxon
Gene PathwayGermplasm
Stock/accession
Map/Genome
Locus
Marker QTL
Alelle
Alelleic gene form
Transcript (mRNA)& Alternative forms
Peptide
Polymorphism
Synteny
EnvironmentClimate
Data and Biocuration
![Page 50: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/50.jpg)
Some points to consider• Need consistency for the community and semantic querying, NLP and data
updates• Keep the Super gene set at the Species/genus level considering we will see
a lot of Pan-genome projects and germplasm-specific gene sets• Populate the super gene set by adding uniques to the pool• For each version of the annotation (not assembly) map to the super pool
and borrow the IDs or create a new one for new genes• Never create version dependent IDs or insert version# in gene ID• Make an arrangement with INSDC/GenBank to deposit the new annotation
(proteomes etc)—THEY WILL TAKE CARE OF THE MAPPING AND VERSION NUMBER as long as IDs are consistent.
• MANY genomes lack/do not deposit annotations
![Page 51: AgBioData Coordinated Innovation Networks Grant...2018/09/05 · Arabidopsis thaliana genome annotaon The story so far History of A. thaliana genome annotaon • Original annotaon:](https://reader035.vdocuments.us/reader035/viewer/2022063020/5fe2db499d40b15c7c182914/html5/thumbnails/51.jpg)
Some points to consider• If the users perform structural annotation to revise the existing model,
continue using the same gene ID at the locus• Release it with the new annotation version periodically• Second/third party annotations: They need to map to the super set IDs and share
their data to become part of the official release. EVERYONE Needs to use the same common set of updated gene sets.
• If a new gene is added, create a new ID• If the genome assembly is in the pseudomolecule form use the number series and
zero padded 100/1000th space to maintain series• If the genome is in scaffolds use the next available number• Release it with the new annotation version periodically (quarterly/bi/annual)