open biomedical knowledge using crowdsourcing and citizen science

Post on 22-Jan-2018

2.024 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Open biomedical knowledge

using crowdsourcing and

citizen science

Andrew Su, Ph.D.@andrewsu

asu@scripps.edu

http://sulab.org

November 5, 2015

UCSD

Slides: slideshare.net/andrewsu

2

Candidate genes

FLNB

CTNNB1

EPHA3

SMAD3

XPO1

RPS27

FLCN

ATR

FLT3

BRD2

ERG

RAF1

EGFR

ERBB4

RARA

JAK3

LRP1

WT1

PML

SMARCA4

Candidate variants

chr1:g.156084782C>G

chr6:g.31911991G>T

chr19:g.3767338C>T

chr19:g.3783925C>T

chr7:g.552021G>A

chr3:g.123005609G>T

3

Biology is an

INFORMATIONscience

Pietro Bellini https://flic.kr/p/k5jmja

Prioritization of human genetic variants4

1000s of genetic variants

< 10 candidate genes

Filters

- Variant type

- Allele frequencies

- Previous clinical

observation

- Predicted

functional effects

- Gene function

- …

Data integration as a cottage industry5

dbNSFP

Data integration as hardened community software6

dbNSFP

MyVariant.info

MyGene.info for integrating gene annotations7

Gene

MyGene.info

MyGene.info for integrating gene annotations8

http://mygene.info/metadata

Current version history

Current stats

MyGene.info for integrating gene annotations9

399070

210381

120173

222497292 3563 1767 1031 616 406 2724

10 20 30 40 50 60 70 80 90 100 More

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

request time (ms)

Fre

qu

en

cyGene annotation service (/v2/gene)

MyGene.info for integrating gene annotations10

2 ~ 3M requests per month

MyGene.info for integrating gene annotations11

MyGene.info for integrating gene annotations12

2015 – 2018

Bioinformatician-friendly JSON output, REST API13

http://MyGene.info/v2/gene/7157 http://MyVariant.info/v1/variant/

chr7:g.55241707G>T

Variant and gene prioritization14

Variant and gene prioritization15

2441

2308

1917

18

9

5

Variant and gene prioritization16

2441

2308

1917

18

9

5

https://github.com/SuLab/myvariant.info/

blob/master/docs/ipynb/myvariant_R_miller.ipynb

Open biomedical knowledge17

MyVariant.info MyGene.info

Integration of molecular

biology databases via

high performance APIs

Open biomedical knowledge18

MyVariant.info MyGene.info

Integration of molecular

biology databases via

high performance APIs

Biomedical Linked

Open Data

The Gene Wiki project19

Protein structure

Symbols and

identifiers

Tissue expression

pattern

Gene Ontology

annotations

Links to structured

databases

Gene

summary

Protein

interactions

Linked

references

Huss, PLoS Biol, 2008

The Gene Wiki project20

The Gene Wiki project21

Wikidata22

Provide a database of the

world’s knowledge that

anyone can edit

- Denny Vrandečić

Centralizing key data storage23

Source: http://commons.wikimedia.org/wiki/File:Wikidata_slides_Magnus_Manske,_Cambridge,_2014-02-27.pdf

Centralizing key data storage24

Centralizing key data storage25

Loading biological data into Wikidata26

Entrez

Gene

Ensembl

UniProt

UCSC

PDB

RefSeq

Wikidata for biology27

is a

regulates

Interacts

with

Protein

Glycoprotein

Neural

development

VLDL receptor

Amyloid

precursor

protein

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

Reelin

http://www.wikidata.org/wiki/Q414043

Wikidata for biology28

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en

29

~150k genes

and proteins

~2k FDA-approved

drugs

~7k human

diseases

Centralizing key data storage30

287 language editions of Wikipedia

Bioinformatics

community

Toxicology

community

Epidemiology

community… …

Open biomedical knowledge31

MyVariant.info MyGene.info

Integration of molecular

biology databases via

high performance APIs

Biomedical Linked

Open Data

Open biomedical knowledge32

Free text to structured data

MyVariant.info MyGene.info

Integration of molecular

biology databases via

high performance APIs

Biomedical Linked

Open Data

The biomedical literature is massive…33

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1983 1988 1993 1998 2003 2008 2013

Number of new PubMed-indexed articles

… but it is very hard to query and compute34

… but it is very hard to query and compute35

Imatinib

Crizotinib

Erlotinib

Gefitinib

Sorafenib

Lapatinib

Dasatinib

Acute myeloid leukemia

Acute lymphoblastic leukemia

Chronic myelogenous leukemia

Chronic lymphocytic leukemia

Hodgkin lymphoma

Non-Hodgkin lymphoma

Myeloma

AND

The Network of BioThings36

1. Identify biomedical concepts in text

… We report a case of familial systemic

mastocytosis with the rare KIT K509I germ

line mutation. In vitro treatment with imatinib,

dasatinib and PKC412 reduced cell viability

of primary mast cells harboring KIT K509I

mutation. Both patients with familial systemic

mastocytosis had remarkable hematological

and skin improvement after three months of

imatinib treatment.

Leuk Res. 2014 Oct;38(10):1245-51. doi: 10.1016/j.leukres.

GENES

DISEASES

DRUGS

VARIANTS

The Network of BioThings37

imatinib

dasatinib

PKC412

Familial systemic

mastocytosis

KIT

K509I

1. Identify biomedical concepts in text

2. Identify relationships between concepts

Mutation

of

Mutation

causes

causes

treats

inhibits

38

Goal: Assemble a network of biomedical

knowledge that is comprehensive,

current, computable and traceable.

Question: Can Citizen Scientists

collectively perform concept recognition in

biomedical texts?

39

Simple annotation interface40

Click to see

instructions

Highlight

disease

mentions

15 workers annotate each abstract

41

Experts versus crowd for concept identification

593 PubMed abstracts

6,900 mentions of

“disease concepts”

F = 0.87F = 0.78

$$$

42

Experts versus crowd for concept identification

593 PubMed abstracts

6,900 mentions of

“disease concepts”

F = 0.87F = 0.87

$$$

• 9 days

• 145 workers

• Total: $630.96

Does Mechanical Turk scale?43

1,000,000 articles per year

10 annotators / article

4 tasks / doc

$0.066 / task

$ 2,640,000 / year

44

http://mark2cure.org

45

Paid crowdsourcing

• F = 0.84

• 28 days

• 212 workers

• Total cost: $0

$$$

• F = 0.87

• 9 days

• 145 workers

• Total: $630.96

“Help science, please”

Citizen Science

Does Citizen Science scale?46

1,000,000 articles * 10 AE / article 15,828

volunteers

needed

10,275 AE * 365 days

212 annotators* 28 days

AE = Annotation events

=

Number of annotation

events per year

Number of annotation

events per year

per volunteer

Does Citizen Science scale?47

15,828

volunteers

needed

175,000

volunteers

300,000

volunteers

37,000

volunteers

1,000,000

volunteers

Annotating the relationships48

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

therapeutic target

subjectpredicate

object

GENE

DISEASE

49

Goal: Assemble a network of biomedical

knowledge that is comprehensive,

current, computable and traceable.

50

Nina Hale https://flic.kr/p/zoVih

Rare disease case study #151

Photo: Retta Beery

52

Bainbridge et al., STM, 2011

53

Photo: Retta Beery

Rare disease case study #254

55

56

… but no obvious treatments

57

Bainbridge et al., STM, 2011

SPR

What differentiates SPR and NGLY1?58

SPR

59

Sarah Olmstead

https://flic.kr/p/364dZW

NGLY1

60

NGLY1

(11 PubMed articles)

Congenital disorders of

glycosylation

(822)

PNGase

(686)ERAD

(1330)

glycosylation

(48,862)

alacrima

(164)

Genetic

interactors

(3016)

symptoms

(109,928)

24 million articles in PubMed

Mapping the biomedical network around NGLY1 61

NGLY1

62

63

A preliminary view of the NGLY1-

focused biological network

Why do I Mark2Cure?64

I am retired, have a doctorate in

medical humanities, and have two

children with Gaucher disease. I am

just looking for some way to put my

education to use. Sounds like a perfect

situation for me.

My 4 year old daughter Phoebe is

living with and battling rare

disease.

I have Ehlers Danlos Syndrome. I hope to help people

learn about this painful and debilitating disorder, so that

others like me can receive more effective medical care.

Take part in

something that

helps humanity.

I Mark2Cure in memory of

my son Mike who had type 1

diabetes.

Studied biology in

college and I really

miss it!

In memory of my daughter

who had Cystic Fibrosis

Give back

Open biomedical knowledge65

Free text to structured data

MyVariant.info MyGene.info

Integration of molecular

biology databases via

high performance APIs

Biomedical Linked

Open Data

66

Contact

http://sulab.org

asu@scripps.edu

@andrewsu

Gene Wiki / Wikidata

Ben Good

Sebastian Burgstaller

Tim Putman

Julia Turner

Ginger Tsueng

Andra Waagmeester

Elvira Mitraka, UMB

Lynn Schriml, UMB

Justin Leong, UBC

Paul Pavlidis, UBC

Join the team!

http://bit.ly/JoinSuLab

Slides: slideshare.net/andrewsu

Funding and Support

BioGPS: GM83924

Gene Wiki: GM089820

MyGene / MyVariant: HG008473

BD2K COE: GM114833

Icon credits (Noun Project, Wikimedia Commons): Zach VanDeHey, hunotika, Viktorvoigt, Alberto Rojas, Lloyd Humphreys

Other Group members

Jake Bruggemann

Ramya Gamini

Karthik Gangavarapu

Louis Gioia

Toby Li

Greg Stupp

MyGene / MyVariant

Chunlei Wu

Cyrus Afrasiabi

Kevin Xin

Adam Mark

Mark2Cure

Max Nanis

Ginger Tsueng

Jennifer Fouquier

Ben Good

Chunlei Wu

All Mark2Curators!

top related