thesis defense, heather piwowar, sharing biomedical research data

Post on 01-Nov-2014

5.044 Views

Category:

Health & Medicine

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation by Heather Piwowar as PhD dissertation defense on March 24, 2010 at the Dept of Biomedical Informatics, U of Pittsburgh. "Foundational studies formeasuring the impact, prevalence, and patterns of publicly sharing biomedical research data." I passed :)

TRANSCRIPT

Foundational studies for measuring the impact, 

prevalence, and patterns of publicly sharing biomedical 

research data

Heather PiwowarDoctoral DefenseMarch 24, 2010

Department of Biomedical InformaticsUniversity of Pittsburgh

Wendy Chapman, PhDBrian Butler, PhD

Ellen Detlefsen, DLS Madhavi Ganapathiraju, PhD Gunther Eysenbach, MD, MPH

http://www.metmuseum.org/toah/ho/09/euwf/ho_24.45.1.htm

http://www.flickr.com/photos/jsmjr/62443357/

http://www.flickr.com/photos/camilleharrington/3587294608/

http://www.flickr.com/photos/rkuhnau/3318245976/

http://www.flickr.com/photos/rkuhnau/3317418699/

http://www.flickr.com/photos/zemlinki/261617721/

http://www.flickr.com/photos/tracenmatt/3020786491/

http://www.flickr.com/photos/conformpdx/1796399674/

http://www.flickr.com/photos/the-o/2078239333/

lots of data sharing!

http://www.genome.jp/en/db_growth.html

but how much isn’t shared?

what isn’t shared?

who isn’t sharing it?why not?

what can we do about it?

how much does it matter?

Prior studies: surveys and/or manual audits

http://www.flickr.com/photos/jima/606588905/

Blumenthal et al. Acad Med. 2006 Campbell et al. JAMA. 2002.

Kyzas et al. J Natl Cancer Inst. 2005.Vogeli et al. Acad Med. 2006.

Reidpath et al. Bioethics 2001.

• small sample sizes• relatively few variables• self-reporting bias • not much focus on measuring demonstrated behavior• not much focus on rewards • not much focus on policy• not much focus on biomedical data other than

DNA sequences

Limitations of related work

I believe analysis of the impact, prevalence, and patterns with which researchers share and withhold biomedical data can uncover rewards, best practices, and opportunities for increased adoption of data sharing.

http://www.flickr.com/photos/archeon/2941655917/

Goal of this dissertation:

Collect useful evidence on patterns of data sharing behaviour through methods that can be applied broadly, repeatably, and cost-effectively.

Aim 1:  Does sharing have benefit for those who share?

Aim 2:  Can sharing and withholding be systematically measured? 

Aim 3:  How often is data shared?  What predicts sharing?  How can we model sharing behavior?

Scope:

• raw research data• upon study publication• making data publicly available on the Internet• one datatype

microarray data

http://en.wikipedia.org/wiki/DNA_microarray

http://en.wikipedia.org/wiki/Image:Heatmap.png

http://commons.wikimedia.org/wiki/File:DNA_double_helix_vertikal.PNG

microarray data

Aim 1

Aim 1:  Does sharing have benefit for those who share?

http://www.flickr.com/photos/sunrise/35819369/

Aim 1:  Does sharing have benefit for those who share?

http://www.flickr.com/photos/sunrise/35819369/

Benefit of value:  Citations.

dataset85 cancer microarray trials published in 1999-2003, as identified by Ntzani and Ioannidis (2003)

citationsISI Web of Science Citation index, citations from 2004-2005

data sharing locationsPublisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine

statisticsMultivariate linear regression

Aim 1:  Does sharing have benefit for those who share?

Aim 1:  Does sharing have benefit for those who share?

Aim 1:  Does sharing have benefit for those who share?

Note the logarithmic scale

Aim 1:  Does sharing have benefit for those who share?

Conclusion:  data sharing is associated with an increase in citation rate

Aim 1:  Does sharing have benefit for those who share?

Next:

What factors predict sharing?

http://www.flickr.com/photos/ryanr/142455033/

Can I use the same methods of Aim 1 to choose studies and determine data sharing status?

Can I use the same methods of Aim 1 to choose studies and determine data sharing status?

No, those methods don’t scale to identify or classify enough datapoints

Aim 2

Need automated methods to:

Aim 2a: Identify studies that create datasets

Aim 2b: Determine which of these have in fact been shared

Aim 2a: Identify studies that create gene expression microarray data

http://www.flickr.com/photos/lofaesofa/248546821/

Aim 2a: Identify studies that create gene expression microarray data

Easy, via MeSH indexing terms?

gene expression profiling and/or

microarray analysis

Unfortunately, these have neither high recall nor precision.

Look for wetlab methods in full text:

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrezhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745

Aim 2a: Identify studies that create gene expression microarray data

Query environment:

Full-text portals query 85% of articles available through U of Pittsburgh library digital subscriptions.

Development set?

Open access articles.

Features? Unigrams and bigrams from full text

Training classifications? Automatic filter for whether publication had an associated dataset deposited in a database

Feature selection and combination:

Derived query:

("gene expression" AND microarray AND cell AND rna)

AND (rneasy OR trizol OR "real-time pcr")

NOT (“tissue microarray*” OR “cpg island*”)

Evaluation:

Ochsner et al. Nature Methods (2008) vol. 5 (12) pp. 991• 400 studies across 20 journals

Precision: 90% (86% to 93%) Recall: 56% (52% to 61%)

Conclusion:  We derived a query with high precision and adequate recall to identify studies that created microarray data

Aim 2a: Identify studies that create gene expression microarray data

Aim 2b

Aim 2b: Identify studies that share their expression microarray data

http://www.flickr.com/photos/dcassaa/422261773/

Aim 2b: Identify studies that share their expression microarray data

Aim 2b: Identify studies that share their expression microarray data

Querying GEO and ArrayExpress for PubMed IDs identified 77% of datasets that were publicly available somewhere on the internet.

Aim 2b: Identify studies that share their expression microarray data

Aim 2b: Identify studies that share their expression microarray data

Aim 2b: Identify studies that share their expression microarray data

Conclusion:  we have a method to find most gene expression microarray datasets shared on the internet, without much bias.

Aim 2b: Identify studies that share their expression microarray data

Aim 3

Aim 3 – How often is data shared? What predicts sharing? How can we model sharing behavior?

Aim 2a + 

Aim 2b + 

lots of stats

http://www.flickr.com/photos/cogdog/123072/

Is research data shared after publication?

Funder Journal Investigator Institution Study

funded by NIH?

size of grant

sharing plan req’d?

funded by non-NIH?

impact factor

strength of policy

open access?

number of microarray studies published

years since first paper

# pubs

# citations

previously shared?

previously reused?

gender

sector

size

impact rank

country

humans?

mice?

plants?

cancer?

clinical trial?

number of authors

year

Funder Journal Investigator Institution Study

journal rank

“An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols available in a publicly accessible database …”

http://www.nature.com/authors/editorial_policies/availability.html

http://www.nature.com/nature/journal/v453/n7197/index.html

journal data sharing policy

institution rank

Yu et al. BMC medical informatics and decision making (2007) vol. 7 pp. 17

study type

Author publication history:

Citation counts:

Author-ity web serviceTorvik & Smalheiser. (2009). Author Name Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11.

Author name disambiguation:

author “experience”

author gender

funding level

PubMed grant lists + NIH grant details

funder mandates

Requires a data sharing planfor studies funded after October 2003

that receive more than $500 000 in direct funding per year

Proxy for NIH data sharing policy applicability:

If in any year since 2004,

• funded by an NIH grant number with a “1” or “2” type code

• received more than $750 000 in total funding from the grant

funder mandates

and so on...

124 variables

Univariate proportions

Factor analysis

Logistic regression

Second-order factor analysis

More logistic regression

stats

http://www.flickr.com/photos/blatzandchocolate/4281306244/

11,603 datapoints

we found shared datasets for 25%

results

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Year article published

Pro

po

rtio

n o

f a

rtic

les w

ith

da

tase

ts f

ou

nd

in

GE

O o

r A

rra

yE

xp

ress

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Proportion of articles with shared datasets, by year

Across time

univariate analysis

Ph

ysio

l G

en

om

ics

PL

oS

Ge

ne

t

Ge

no

me

Bio

l

Microbiology

PL

oS

On

e

BM

C G

en

om

ics

Pla

nt

Ce

ll

Ge

no

me

Re

s

Eu

ka

ryo

t C

ell

Ap

pl E

nviro

n M

icro

bio

lB

MC

Me

d G

en

om

ics

Hu

m M

ol G

en

et

Pro

c N

atl A

ca

d S

ci U

S A

Infe

ct

Imm

un

Am

J R

esp

ir C

ell

Mo

l B

iol

De

v B

iol

J B

acte

rio

l

Mo

l E

nd

ocrin

ol

BM

C C

an

ce

r

Pla

nt

Ph

ysio

lB

iol R

ep

rod

Blood

J I

mm

un

ol

FA

SE

B J

To

xic

ol S

ci

J E

xp

Bo

tN

ucle

ic A

cid

s R

es

Diabetes

Mo

l C

ell B

iol

Mo

l C

an

ce

r T

he

r

BM

C B

ioin

form

atics

Ste

m C

ells

FE

BS

Le

tt

J N

eu

rosci

Am

J P

ath

ol

J B

iol C

he

m

J V

iro

l

OTHER

Ca

nce

r R

es

J C

lin

En

do

crin

ol M

eta

b

Pla

nt

Mo

l B

iol

Clin

Ca

nce

r R

es

Genomics

Inve

st

Op

hth

alm

ol V

is S

ci

Mo

l H

um

Re

pro

dCarcinogenesis

Gene

Endocrinology

Oncogene

Ca

nce

r L

ett

Bio

ch

em

Bio

ph

ys R

es C

om

mu

n

Pro

port

ion o

f data

sets

share

d

0.0

0.2

0.4

0.6

0.8

1.0 Journals

Sta

nfo

rd U

niv

ers

ity

Un

ive

rsity o

f P

en

nsylv

an

ia

Un

ive

rsity o

f Illin

ois

Un

ive

rsity o

f C

alif

orn

ia,

Lo

s A

ng

ele

s

Un

ive

rsity o

f W

isco

nsin

, M

ad

iso

n

Un

ive

rsity o

f W

ash

ing

ton

Un

ive

rsity o

f C

alif

orn

ia,

Da

vis

Th

e U

niv

ers

ity o

f B

ritish

Co

lum

bia

Un

ive

rsity o

f C

alif

orn

ia,

Sa

n F

ran

cis

co

Un

ive

rsity o

f F

lorid

a

Un

ive

rsity o

f C

alif

orn

ia,

Sa

n D

ieg

o

Un

ive

rsity o

f M

inn

eso

ta,

Tw

in C

itie

s

Ba

ylo

r C

olle

ge

of

Me

dic

ine

OTHER

Ma

x P

lan

ck G

ese

llsch

aft

Ha

rva

rd U

niv

ers

ity

Du

ke

Un

ive

rsity M

ed

ica

l C

en

ter

Ya

le U

niv

ers

ity

Jo

hn

s H

op

kin

s U

niv

ers

ity

Un

ive

rsity o

f P

itts

bu

rgh

Wa

sh

ing

ton

Un

ive

rsity in

Sa

int

Lo

uis

Un

ive

rsity o

f T

oro

nto

Un

ive

rsity o

f C

alif

orn

ia,

Be

rke

ley

Un

ive

rsity o

f M

ich

iga

n,

An

n A

rbo

r

Mic

hig

an

Sta

te U

niv

ers

ity

Na

tio

na

l C

an

ce

r In

stitu

te

To

kyo

Da

iga

ku

Pro

po

rtio

n o

f d

ata

se

ts s

ha

red

0.0

0.2

0.4

0.6

0.8

1.0

Institutions

1

101

201

301

401

501

601

701

801

901

1001

1101

1201

1301

1401

1501

1601

1701

1801

1901

Pro

po

rtio

n o

f d

ata

se

ts s

ha

red

0.0

0.2

0.4

0.6

0.8

1.0

Institutionrank

multivariate analysis

factor analysis

logistic regression

Odds Ratio

0.25 0.50 1.00 2.00 4.00 8.00

Has journal policy0.95Count of R01 & other NIH grants

Authors prev GEOAE sharing & OA & microarray creation

NO K funding or P funding

Institution high citations & collaboration

Journal impact

Journal policy consequences & long halflife

NOT animals or mice

Instititution is government & NOT higher ed

Last author num prev pubs & first year pub

Large NIH grant

Humans & cancer

NO geo reuse + YES high institution output

First author num prev pubs & first year pub

Multivariate nonlinear regressions with interactions

Odds Ratio

0.25 0.50 1.00 2.00 4.00 8.00

Has journal policy0.95Count of R01 & other NIH grants

Authors prev GEOAE sharing & OA & microarray creation

NO K funding or P funding

Journal impact

Journal policy consequences & long halflife

Institution high citations & collaboration

NOT animals or mice

Instititution is government & NOT higher ed

Last author num prev pubs & first year pub

Large NIH grant

Humans & cancer

NO geo reuse + YES high institution output

First author num prev pubs & first year pub

Multivariate nonlinear regressions with interactions

Odds Ratio

0.25 0.50 1.00 2.00 4.00 8.00

Has journal policy0.95Count of R01 & other NIH grants

Authors prev GEOAE sharing & OA & microarray creation

NO K funding or P funding

Institution high citations & collaboration

Journal impact

Journal policy consequences & long halflife

NOT animals or mice

Instititution is government & NOT higher ed

Last author num prev pubs & first year pub

Large NIH grant

Humans & cancer

NO geo reuse + YES high institution output

First author num prev pubs & first year pub

Multivariate nonlinear regressions with interactions

Odds Ratio

0.25 0.50 1.00 2.00 4.00 8.00

Has journal policy0.95Count of R01 & other NIH grants

Authors prev GEOAE sharing & OA & microarray creation

NO K funding or P funding

Journal impact

Journal policy consequences & long halflife

Institution high citations & collaboration

NOT animals or mice

Instititution is government & NOT higher ed

Last author num prev pubs & first year pub

Large NIH grant

Humans & cancer

NO geo reuse + YES high institution output

First author num prev pubs & first year pub

Multivariate nonlinear regressions with interactions

second-order factor analysis

Instititu

tio

n is g

ove

rnm

en

t &

NO

T h

igh

er

ed

NO

T in

stitu

tio

n N

CI

or

intr

am

ura

l

NO

K f

un

din

g o

r P

fu

nd

ing

Jo

urn

al p

olic

y c

on

se

qu

en

ce

s &

lo

ng

ha

lflif

e

Au

tho

rs p

rev G

EO

AE

sh

arin

g &

OA

& m

icro

arr

ay c

rea

tio

n

Institu

tio

n h

igh

cita

tio

ns &

co

llab

ora

tio

n

NO

T a

nim

als

or

mic

e

First

au

tho

r n

um

pre

v p

ub

s &

first

ye

ar

pu

b

Hu

ma

ns &

ca

nce

r

Co

un

t o

f R

01

& o

the

r N

IH g

ran

ts

La

rge

NIH

gra

nt

Ha

s jo

urn

al p

olic

y

NO

ge

o r

eu

se

+ Y

ES

hig

h in

stitu

tio

n o

utp

ut

La

st

au

tho

r n

um

pre

v p

ub

s &

first

ye

ar

pu

b

Jo

urn

al im

pa

ct

Journal impact

Last author num prev pubs & first year pub

NO geo reuse + YES high institution output

Has journal policy

Large NIH grant

Count of R01 & other NIH grants

Humans & cancer

First author num prev pubs & first year pub

NOT animals or mice

Institution high citations & collaboration

Authors prev GEOAE sharing & OA & microarray creation

Journal policy consequences & long halflife

NO K funding or P funding

NOT institution NCI or intramural

Instititution is government & NOT higher ed

logistic regressionusing second-order factors

Odds Ratio

0.25 0.50 1.00 2.00 4.00

OA journal & previous GEO-AE sharing

0.95Amount of NIH funding

Journal impact factor and policy

Higher Ed in USA

Cancer & humans

Multivariate nonlinear regression with interactions

Odds Ratio

0.25 0.50 1.00 2.00 4.00

OA journal & previous GEO-AE sharing

0.95Amount of NIH funding

Journal impact factor and policy

Higher Ed in USA

Cancer & humans

Multivariate nonlinear regression with interactions

size of effect:split at the medians of the factors

Overall:25%

Open access/previous

sharing: 31%

LessOA/prev

sharing: 19%

Overall:25%

Open access/previous

sharing: 31%

LessOA/prev

sharing: 19%

cancer/human: 18%

Notcancer/human:

32%

Overall:25%

24% 37%Open access/

previous sharing: 31%

13% 25%Less

OA/prev sharing: 19%

cancer/human: 18%

Notcancer/human:

32%

Overall:25%

Conclusions:

• data sharing rates are increasing, but overall levels are low

Preliminary evidence:• levels are particularly low in cancer• levels are highest for those who are publishing OA, have shared before

• data and filters were imperfect• many assumptions• didn’t capture all types of sharing• don’t know how generalizable across datatypes• should be considered hypothesis-generating

http://www.flickr.com/photos/vlastula/300102949/

Goal of this dissertation:

Collect useful evidence on patterns of data sharing behaviour through methods that can be applied broadly, repeatably, and cost-effectively.

contribution

• Aim 1 publication cited 45 times in Google Scholar, including by several editorials and books

• Aim 2 methods reused in a neuroethics study at UBC• Aim 3 revealed evidence suggesting areas with high and

low data sharing adoption for future study• data collection was mostly automated using mostly free,

and open resources• dataset, collection code, analysis scripts to be made

openly available upon publication of thesis

http://www.flickr.com/photos/skrb/2427171774/

what’s next?

More data analysis

Including:• Citation analysis of the 11,603 articles• Analysis with a focus on policy variables• Causality through structural equation

modeling

doi/10.1371/journal.pone.0008469.g002

Begin to investigate reuse

http://www.flickr.com/photos/boitabulle/3668162701/

who reuses data?

when?

why aren’t they?

which datasets are most likely to be reused?

what can we do about it?

how many datasets could be reused but aren’t?

why?

who doesn’t?

what should we do about it?

Postdoctoral Research Associate in the Sharing, Preservation, and Stewardship of Scientific Data

Potential areas of focus include:• overcoming social and technological

barriers to data deposition among scientists

• the roles and interactions of individual scientists, journals/publishers, institutions, and the variety of disciplinary repositories

• ...

Post‐doc of my dreams

http://www.flickr.com/photos/gatewaystreets/3838452287/

Enable new science and knowledge creation through universal access to data about life on earth and the environment that sustains it.

Dryad is a repository of data underlying scientific publications, with an initial focus on evolution, ecology, and related fields.

The National Evolutionary Synthesis Center, NSF-funded:

• Duke University,• UNC at Chapel Hill• North Carolina State University

Data sharing is hard.

I share my code and data at http://www.researchremix.org

It is hard.Some is better than none.Be the change you want to see.

http://www.flickr.com/photos/myklroventine/892446624/

Thanks to

the Dept of Biomedical Informatics at the U of Pittsburgh,

the NLM for funding through training grant 5 T15 LM007059,

those who openly publish their data, source code, papers, photos,

Dr. Wendy Chapman for her support and feedback,

My family.

http://www.flickr.com/photos/jep42/3017149415/in/set-72157608797298056/

top related