isb2012: the gene wiki: crowdsourcing human gene annotation
DESCRIPTION
some animations don't adapt well to static slides -- download the ppt file to view...TRANSCRIPT
The Gene Wiki: Crowdsourcing human gene annotation
Andrew Su, Ph.D.Department of Molecular and Experimental Medicine
The Scripps Research Institute
Biocuration 2012
April 2, 2012
The Long Tail is a prolific source of content2
ShortHead
Long Tail
Content produced
Contributors (sorted)
News :Video:
Product reviews:Food reviews:Talent judging:
Gene annotation:
NewspapersTV/Hollywood
Consumer reportsFood criticsOlympics
Manual curation
BlogsYouTube
Amazon reviewsYelp
American IdolGene Wiki
3
We can harness the Long Tail of scientists to directly participate in
the gene annotation process.
Wikipedia is reasonably accurate4
Wikipedia has breadth and depth5
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
Articles
Words(millions)
Wikipedia Britannica Online
Filtering, extracting, and summarizing PubMed
Documents
Concepts
Wiki success depends on a positive feedback7
Gene wiki page utility
Number ofusers
Number ofcontributors
1001
2002
10,000 gene “stubs” within Wikipedia8
Protein structure
Symbols and identifiers
Tissue expression pattern
Gene Ontology annotations
Links to structured databases
Gene summary
Protein interactions
Linked references
Huss, PLoS Biol, 2008
Utility
Users
Contributors
Gene Wiki has a critical mass of readers9
Total: ~4.3 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
Utility
Users
Contributors
Gene Wiki has a critical mass of editors10
Good, NAR, 2011
Utility
Users
Contributors
Cum
ulat
ive
edits
Productive edits
Vandalism
~10,000 words added / month
4.3 million views / month
1000 edits / month
Total 1.42 million words ≈ 230 full-length articles
A review article for every gene is powerful11
Hyperlinks to related concepts
References to the literature
Reelin: 68 editors, 543 edits since July 2002
Heparin: 175 editors, 320 edits since June 2003
AMPK: 44 editors, 84 edits since March 2004
RNAi: 232 editors, 708 edits since October 2002
Making the Gene Wiki more computable12
Structured annotationsFree text
Filling the gaps in gene annotation13
Wikilink
GO exact synonym
Gene Wiki mapping
NCBI Entrez Gene: 3362
GO:0004993
Candidate assertion
Filling the gaps in gene annotation14
Wikilink
GO exact match
Gene Wiki mapping
NCBI Entrez Gene: 334
GO:0006897
Candidate assertion
Disease associations mined from the Gene Wiki
2147 candidate
annotations
Gene Wiki Articles (10,271)
Filter out seeded text
NCBO Annotator
Compare to DO database
Matched Disease Ontology terms
(2983)
70% have no match
2% match child
23% exact match
5% match parent
Good, BMC Genomics 2011, 12:603
Disease associations mined from the Gene Wiki
Expert curation
Correct86%
Maybe: 4%
Incorrect: 10%
Overall specificity: 90-93%
Good, BMC Genomics 2011, 12:603
GO associations mined from the Gene Wiki
6319 candidate
annotations
Gene Wiki Articles (10,271)
Filter out seeded text
NCBO Annotator
Compare to GO database
Matched Gene Ontology terms
(11,022)
55% have no match
2% match child
17% exact match
26% match parent
Good, BMC Genomics 2011, 12:603
GO associations mined from the Gene Wiki
Expert curation
Correct
Maybe
Incorrect 60%
Overall specificity: 48-64%
26%
14%
Good, BMC Genomics 2011, 12:603
Common sources of error in GO associations19
OR2F1: “Olfactory receptors … are responsible for the recognition and G protein-mediated transduction of odorant signals.”
1) Incorrect concept recognition
Transduction (GO:0009293)
The transfer of genetic information to a bacterium from a bacteriophage or between bacterial or yeast cells mediated by a phage vector.
Signal transduction (GO:0007165)
The cellular process in which a signal is conveyed to trigger a change in the activity or state of a cell. Signal transduction begins with reception of a signal, e.g. a ligand binding to a receptor or receptor activation by a stimulus such as light, and ends with regulation of a downstream cellular process…
Good, BMC Genomics 2011, 12:603
Common sources of error in GO associations20
MEF2C: “Several post translational modifications have been identified including phosphorylation on serine-59 …”
2) Incorrect sentence context
DephosphorylationExcretionGene expressionGlycosylationLocalizationMethylationProteolysisSecretionTransportTranscriptionTranslation
MEF2C
Myelination
Phosporylation
Neurogenesis
Good, BMC Genomics 2011, 12:603
Novel GO annotations – so what?21
11,022 annotations mined from Gene Wiki
4703 (43%) match known annotations
~100,000 annotations
from GO consortium
6319 “novel”
annotations @ 48-64% specificity
Gene Wiki content improves enrichment analysis22
GO term
Gene listConcept
recognitionPubMed abstracts
Enrichment analysis
GO:0007411
axon guidance
(GO:0007411)
264 genes
Linked genes through PubMed
P = 1.55 E-20
811 articles
Yes No
Yes 13 2
No 251 12033
Gene Wiki content improves enrichment analysis23
GO term
Gene listConcept
recognitionPubMed abstracts
Gene Wiki
+
Enrichment analysis
GO:0006936 GO:0006936
muscle contraction
(GO:0006936)
87 genes
Linked genes through PubMed
Linked genes through
PubMed + Gene Wiki
P = 1.0 P = 1.22 E-09
251 articles
87 articles
Gene Wiki content improves enrichment analysis24
p-value (PubMed only)
p-value (PubMed + GW)
Muscle contraction
More significant
PubMed + GW
More significant
PubMed only
Challenges and future directions
• How to complement and integrate with traditional biocuration workflows?
• How to disseminate and utilize crowdsourced annotations?
25
The Long Tail of scientists is a valuable source of
information on gene function
26
27
Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,
Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors
WP:MCB Project
Collaborators
Erik ClarkeBen Good (*)Salvatore Loguercio
Ian MacleodChunlei Wu
Group members
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
Contacthttp://sulab.org
[email protected]@andrewsu+Andrew Su
See poster # 30 for more on the Gene Wiki and
crowdsourcing in biology!
Making the Gene Wiki more reliable28
The company name is derived from old Greek, and means
"destroyer of birds".
Novartis is a multinational pharmaceutical company
based in Basel, Switzerland that manufactures drugs such
as clozapine (Clozaril), diclofenac (Voltaren), …
2
2
Making the Gene Wiki more reliable29
http://www.wikitrust.net/
The company name is derived from old Greek, and means
"destroyer of birds".
Novartis is a multinational pharmaceutical company
based in Basel, Switzerland that manufactures drugs such
as clozapine (Clozaril), diclofenac (Voltaren), …
*
36211 total edits 36 total edits
High-trust author Low-trust author
******
** *
*
*
**
2