a centralized model organism database (cmod) for the long tail of sequenced genomes andrew su, ph.d....
TRANSCRIPT
![Page 1: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/1.jpg)
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced
Genomes
Andrew Su, Ph.D.@andrewsu
[email protected]://sulab.org
January 16, 2014
GMOD 2014
OK
OK
![Page 2: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/2.jpg)
Why am I giving this keynote?
2
![Page 3: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/3.jpg)
3
http://www.flickr.com/photos/portland_mike/6140660504/
Harnessing the crowd…
![Page 4: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/4.jpg)
4
… to organize information
http://www.flickr.com/photos/45697441@N00/6629580443
![Page 5: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/5.jpg)
My simplified history of MODs5
![Page 6: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/6.jpg)
My simplified history of MODs6
![Page 7: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/7.jpg)
GMOD is widely used7
199 (!) organizations listed as GMOD users
![Page 8: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/8.jpg)
Does the current model scale?8
![Page 9: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/9.jpg)
Does the current model scale?9
![Page 10: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/10.jpg)
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
1
10
100
1000
10000
100000
1000000
Bacteria
Eukaryotes
Archaea
Does the current model scale?10
# sequenced genomes
Year
![Page 11: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/11.jpg)
Does the current model scale?11
![Page 12: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/12.jpg)
The Long Tail of genomic data is being lost12
Identified 517 operons and 103 small regulatory RNAs...
![Page 13: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/13.jpg)
The Long Tail of genomic data is being lost13
Identified 517 operons and 103 small regulatory RNAs...
![Page 14: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/14.jpg)
At least you can download structured data…14
![Page 15: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/15.jpg)
Centralized Model Organism Database concept15
CMOD
![Page 16: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/16.jpg)
16
http://www.flickr.com/photos/aigle_dore/5626312363/
GMOD as a Service (GaaS)
![Page 17: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/17.jpg)
17
http://www.flickr.com/photos/shannonmary/187131727/
![Page 18: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/18.jpg)
Few genes are well annotated…18
Data: NCBI, February 2013
41%
65%
CTNNB1VEGFASIRT1FGFR2TGFB1TP53MEF2CBMP4LEF1WNT5ATNF
20,473 protein-coding genes
Genes, sorted by decreasing counts
GO
An
no
tati
on
C
ou
nts
![Page 19: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/19.jpg)
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
0
200,000
400,000
600,000
800,000
1,000,000
Number of PubMed-indexed articles
… because the literature is sparsely curated?19
![Page 20: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/20.jpg)
… because the literature is sparsely curated?20
0
1 0
2 0
Average capacity of human scientistNumber of articles read by typical scientist
![Page 21: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/21.jpg)
21
311,696 articles (1.5% of PubMed)have been cited by GO annotations
![Page 22: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/22.jpg)
22
0
Sooner or later, the research community will
need to be involved in the annotation effort to scale
up to the rate of data generation.
![Page 23: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/23.jpg)
The Long Tail is a prolific source of content23
ShortHead
Long Tail
Content produced
Contributors (sorted)
News :Video:
Product reviews:Food reviews:Talent judging:
NewspapersTV/Hollywood
Consumer reportsFood criticsOlympics
BlogsYouTube
Amazon reviewsYelp
American Idol
![Page 24: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/24.jpg)
Wikipedia is reasonably accurate24
![Page 25: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/25.jpg)
Wikipedia has breadth and depth25
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
Articles
Words(millions)
Wikipedia Britannica Online
![Page 26: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/26.jpg)
26
We can harness the Long Tail of scientists to directly participate in
the gene annotation process.
![Page 27: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/27.jpg)
Filtering, extracting, and summarizing PubMed
Documents
Concepts Review article
![Page 28: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/28.jpg)
Filtering, extracting, and summarizing PubMed
Documents
Concepts
![Page 29: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/29.jpg)
Wiki success depends on a positive feedback29
Gene wiki page utility
Number ofusers
Number ofcontributors
1001
2002
![Page 30: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/30.jpg)
10,000 gene “stubs” within Wikipedia30
Protein structure
Symbols and identifiers
Tissue expression pattern
Gene Ontology annotations
Links to structured databases
Gene summary
Protein interactions
Linked references
Huss, PLoS Biol, 2008
Utility
Users
Contributors
![Page 31: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/31.jpg)
Gene Wiki has a critical mass of readers31
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
Utility
Users
Contributors
![Page 32: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/32.jpg)
Gene Wiki has a critical mass of editors32
Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
Utility
Users
Contributors
Edi
tor
coun
t Editors
Edits Edi
t co
unt
![Page 33: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/33.jpg)
A review article for every gene is powerful33
References to the literature
Hyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002
Heparin: 358 editors, 654 edits since June 2003
AMPK: 109 editors, 203 edits since March 2004
RNAi: 394 editors, 994 edits since October 2002
![Page 34: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/34.jpg)
Making the Gene Wiki more computable34
Structured annotationsFree text
![Page 35: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/35.jpg)
Filling the gaps in gene annotation35
Wikilink
GO exact match
Gene Wiki mapping
NCBI Entrez Gene: 334
Candidate assertion
GO:0006897
6319 novel GO annotations2147 novel DO annotations
![Page 36: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/36.jpg)
Gene Wiki content improves enrichment analysis36
GO term
Gene listConcept
recognitionPubMed abstracts
Enrichment analysis
GO:0007411
axon guidance
(GO:0007411)
264 genes
Linked genes through PubMed
P = 1.55 E-20
811 articles
Yes No
Yes 13 2
No 251 12033
![Page 37: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/37.jpg)
Gene Wiki content improves enrichment analysis37
GO term
Gene listConcept
recognitionPubMed abstracts
Gene Wiki
+
Enrichment analysis
GO:0006936 GO:0006936
muscle contraction
(GO:0006936)
87 genes
Linked genes through PubMed
Linked genes through
PubMed + Gene Wiki
P = 1.0 P = 1.22 E-09
251 articles
87 articles
![Page 38: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/38.jpg)
Gene Wiki content improves enrichment analysis38
p-value (PubMed only)
p-value (PubMed + GW)
Muscle contraction
More significant
PubMed + GW
More significant
PubMed only
![Page 39: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/39.jpg)
The Long Tail of scientists is a valuable source of
information on gene function
39
![Page 40: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/40.jpg)
http://fiehnlab.ucdavis.edu/projects/rice_metabolome/
Can we skip text mining?
![Page 41: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/41.jpg)
Wikidata41
Provide a database of the world’s knowledge that
anyone can edit
- Denny Vrandečić
![Page 42: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/42.jpg)
Wikidata understands scale42
![Page 43: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/43.jpg)
Wikidata understands scale43
14 million Wikidata items…
…13 million total genes in Entrez Gene
![Page 44: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/44.jpg)
Wikidata understands scale44
27 million Wikidata statements…
…150k total GO annotations
![Page 45: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/45.jpg)
Wikidata for biology45
is a
regulates
Interacts with
Protein
Glycoprotein
Neural development
VLDL receptor
Amyloid precursor protein
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
Reelin
http://www.wikidata.org/wiki/Q414043
![Page 46: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/46.jpg)
Wikidata for biology46
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
![Page 47: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/47.jpg)
Increasing biological data in Wikidata47
http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force
![Page 48: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/48.jpg)
Loading genomic data into Wikidata48
Entrez Gene
Ensembl
UniProt
UCSC
PDB
RefSeq
![Page 49: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/49.jpg)
Wikidata gene model49
Added ~1000 human genes so far….
![Page 50: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/50.jpg)
Wikidata as CMOD?50
CMOD
![Page 51: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/51.jpg)
Wikidata as CMOD?51
CMODPowered by:
CMOD
![Page 52: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/52.jpg)
The Long Tail of
bioinformaticianscan collaboratively build a Centralized Model Organism
Database (CMOD).
52
![Page 53: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu January](https://reader035.vdocuments.us/reader035/viewer/2022062322/56649f055503460f94c195d2/html5/thumbnails/53.jpg)
53
Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,
Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon LimMany Wikipedia editors
WP:MCB Project
Gene Wiki Collaborators
Katie FischBen GoodSalvatore Loguercio
Tobias MeissnerMax NanisChunlei Wu
Group members
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
Contacthttp://sulab.org
[email protected]@andrewsu+Andrew Su
Adriel CarolinoErik ClarkeJon HussMarc LegliseMaximilian LudvigssonIan MacLeodCamilo Orozco
Key group alumni
Recruiting for student,
postdoc, outreach, and/or
staff positions!