![Page 1: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/1.jpg)
Microtask crowdsourcing for
annotating diseases in
PubMed abstracts
Andrew Su, Ph.D.@andrewsu
http://sulab.org
October 20, 2014
ASHG
Slides: slideshare.net/andrewsu
OK
OK
OK
![Page 2: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/2.jpg)
Potential conflicts of interest
• Novartis
• Assay Depot
• Avera Health
2
![Page 3: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/3.jpg)
3
Condition A Condition B
Candidate
genes/
proteins
RNA-seqExome seq
Whole
genome seq
ProteomicsGenotyping
Copy-number
analysis
Genome-scale profiling
ChIP-seqMethylation
Functional
genomics
![Page 4: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/4.jpg)
4
Candidate
genes/
proteins
Related
diseases
Related
drugs
Related
pathways
![Page 5: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/5.jpg)
Databases are fragmented and incomplete5
KEGG
(4)
OMIM
(6)
PharmGKB
(10)
HuGE
Navigator
(517)
0
2
0
20
0
0
0
0
0
x
2
507
1
6
Disease links for Apolipoprotein E
![Page 6: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/6.jpg)
6
![Page 7: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/7.jpg)
7
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1983 1988 1993 1998 2003 2008 2013
Number of new PubMed-indexed articles
![Page 8: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/8.jpg)
8
![Page 9: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/9.jpg)
9
http://www.flickr.com/photos/portland_mike/6140660504/
Harnessing
the crowd…
![Page 10: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/10.jpg)
10
… to organize
information
http://www.flickr.com/photos/45697441@N00/6629580443
![Page 11: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/11.jpg)
Information extraction for a Network of BioThings11
1. Find mentions of high level concepts in
text
2. Map mentions to specific terms in
ontologies
3. Identify relationships between concepts
Genes/
proteins
Diseases
DrugsPathways
![Page 12: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/12.jpg)
The NCBI Disease corpus12
• 793 PubMed abstracts
• 12 expert annotators (2 annotate each
abstract)
6,900 “disease” mentions
Doğan, Rezarta, and Zhiyong Lu. Proceedings of the 2012 Workshop on Biomedical
Natural Language Processing. Association for Computational Linguistics.
![Page 13: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/13.jpg)
Question: Can a group of non-scientists
collectively perform concept
recognition in biomedical texts?
13
![Page 14: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/14.jpg)
Experimental design
Task: Identify the disease mentions in the
PubMed abstracts from the NCBI disease
corpus
– 5 non-scientists annotate each abstract
– The details:
• Recruit workers using Amazon Mechanical Turk
• Pay $0.066 per Human Intelligence Task (HIT)
• HIT = annotate one abstract from PubMed
14
![Page 15: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/15.jpg)
Instructions to workers15
• Highlight all diseases and disease abbreviations
• “...are associated with Huntington disease ( HD )... HD patients
received...”
• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked
immunodeficiency…”
• Highlight the longest span of text specific to a disease
• “... contains the insulin-dependent diabetes mellitus locus …”
• Highlight disease conjunctions as single, long spans.
• “... a significant fraction of familial breast and ovarian cancer , but
undergoes…”
• Highlight symptoms - physical results of having a
disease
– “XFE progeroid syndrome can cause dwarfism, cachexia, and
microcephaly. Patients often display learning disabilities, hearing loss,
and visual impairment.
![Page 16: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/16.jpg)
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
Aggregation function based on simple voting16
1 or more votes (K=1)This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
K=2
K=3 K=4
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
![Page 17: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/17.jpg)
Comparison to gold standard17
F score = 0.81Precision
Recall
![Page 18: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/18.jpg)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 3 6 9 12 15 18
Comparison to gold standard18
Max F = 0.69 0.79 0.82
k=1
2
3
2
3 4 5
0.85
k=1
N = 3 6 9 12 15 18
7 8
0.85 0.85
![Page 19: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/19.jpg)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 3 6 9 12 15 18
Comparison to gold standard19
Max F = 0.69 0.79 0.82
k=1
2
3
2
3 4 5
0.85
k=1
N = 3 6 9 12 15 18
7 8
0.85 0.85
![Page 20: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/20.jpg)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 3 6 9 12 15 18
Comparison to gold standard20
Max F = 0.69 0.79 0.82
k=1
2
3
2
3 4 5
0.85
k=1
N = 3 6 9 12 15 18
7 8
0.85 0.85
![Page 21: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/21.jpg)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 3 6 9 12 15 18
Comparison to gold standard21
Max F = 0.69 0.79 0.82
k=1
2
3
2
3 4 5
0.85
k=1
N = 3 6 9 12 15 18
7 8
0.85 0.85
![Page 22: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/22.jpg)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 3 6 9 12 15 18
Comparison to gold standard22
Max F = 0.69 0.79 0.82
k=1
2
3
2
3 4 5
0.85
k=1
N = 3 6 9 12 15 18
7 8
0.85 0.85
F = 0.76 – score of single Ph.D. annotator
F = 0.87 – agreement between multiple Ph.D. annotators
![Page 23: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/23.jpg)
23
Crowd-based biocuration
• 7 days
• 17 workers
• $192.90
Professional biocuration
• Many months
• 12 experts
• $150,000+
In aggregate, our worker
ensemble is faster, cheaper
and as accurate as a single
expert annotator for disease
concept recognition.
![Page 24: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/24.jpg)
Information extraction for a Network of BioThings24
1. Find mentions of high level concepts in
text
2. Map mentions to specific terms in
ontologies
3. Identify relationships between concepts
Genes/
proteins
Diseases
DrugsPathways
![Page 25: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/25.jpg)
Vision-based Citizen Science
• Galaxy Zoo (galaxy classification; 110M+
classifications, 300k+ volunteers)
• Foldit (protein folding; 350k+ players)
• Eterna (RNA folding; 80k players)
• Eyewire (3D neuron structure determination;
130k volunteers)
• Phylo (multiple sequence alignment; 30k+
players, 285k alignments)
• …
25
![Page 27: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)](https://reader034.vdocuments.us/reader034/viewer/2022042816/55942f201a28ab3d3d8b462f/html5/thumbnails/27.jpg)
`
27
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820, DA036134)
The Su Lab
Chunlei Wu
Ben Good
Salvatore Loguercio
Max Nanis
Louis Gioia
Ramya Gamini
Greg Stupp
Ginger Tsueng
Erick Scott
Vyshakh Babji
Karthik Gangavarapu
Adam Mark
Key Alumni
Katie Fisch
Tobias Meissner
Key Collaborators
Andra Waagmeester
Lynn Schriml
Peter Robinson
Contact
http://sulab.org
@andrewsu
+Andrew Su
We are recruiting
programmers,
postdocs, and
awesome people of
all kinds!
bit.ly/SuLabJobs
We are hosting a hackathon
Nov 7-9 for the Network of
BioThingsbit.ly/hackNoB