thesis defense, heather piwowar, sharing biomedical research data
DESCRIPTION
Presentation by Heather Piwowar as PhD dissertation defense on March 24, 2010 at the Dept of Biomedical Informatics, U of Pittsburgh. "Foundational studies formeasuring the impact, prevalence, and patterns of publicly sharing biomedical research data." I passed :)TRANSCRIPT
Foundational studies for measuring the impact,
prevalence, and patterns of publicly sharing biomedical
research data
Heather PiwowarDoctoral DefenseMarch 24, 2010
Department of Biomedical InformaticsUniversity of Pittsburgh
Wendy Chapman, PhDBrian Butler, PhD
Ellen Detlefsen, DLS Madhavi Ganapathiraju, PhD Gunther Eysenbach, MD, MPH
http://www.metmuseum.org/toah/ho/09/euwf/ho_24.45.1.htm
http://www.flickr.com/photos/jsmjr/62443357/
http://www.flickr.com/photos/camilleharrington/3587294608/
http://www.flickr.com/photos/rkuhnau/3318245976/
http://www.flickr.com/photos/rkuhnau/3317418699/
http://www.flickr.com/photos/zemlinki/261617721/
http://www.flickr.com/photos/tracenmatt/3020786491/
http://www.flickr.com/photos/conformpdx/1796399674/
http://www.flickr.com/photos/the-o/2078239333/
lots of data sharing!
http://www.genome.jp/en/db_growth.html
but how much isn’t shared?
what isn’t shared?
who isn’t sharing it?why not?
what can we do about it?
how much does it matter?
Prior studies: surveys and/or manual audits
http://www.flickr.com/photos/jima/606588905/
Blumenthal et al. Acad Med. 2006 Campbell et al. JAMA. 2002.
Kyzas et al. J Natl Cancer Inst. 2005.Vogeli et al. Acad Med. 2006.
Reidpath et al. Bioethics 2001.
• small sample sizes• relatively few variables• self-reporting bias • not much focus on measuring demonstrated behavior• not much focus on rewards • not much focus on policy• not much focus on biomedical data other than
DNA sequences
Limitations of related work
I believe analysis of the impact, prevalence, and patterns with which researchers share and withhold biomedical data can uncover rewards, best practices, and opportunities for increased adoption of data sharing.
http://www.flickr.com/photos/archeon/2941655917/
Goal of this dissertation:
Collect useful evidence on patterns of data sharing behaviour through methods that can be applied broadly, repeatably, and cost-effectively.
Aim 1: Does sharing have benefit for those who share?
Aim 2: Can sharing and withholding be systematically measured?
Aim 3: How often is data shared? What predicts sharing? How can we model sharing behavior?
Scope:
• raw research data• upon study publication• making data publicly available on the Internet• one datatype
microarray data
http://en.wikipedia.org/wiki/DNA_microarray
http://en.wikipedia.org/wiki/Image:Heatmap.png
http://commons.wikimedia.org/wiki/File:DNA_double_helix_vertikal.PNG
microarray data
Aim 1
Aim 1: Does sharing have benefit for those who share?
http://www.flickr.com/photos/sunrise/35819369/
Aim 1: Does sharing have benefit for those who share?
http://www.flickr.com/photos/sunrise/35819369/
Benefit of value: Citations.
dataset85 cancer microarray trials published in 1999-2003, as identified by Ntzani and Ioannidis (2003)
citationsISI Web of Science Citation index, citations from 2004-2005
data sharing locationsPublisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine
statisticsMultivariate linear regression
Aim 1: Does sharing have benefit for those who share?
Aim 1: Does sharing have benefit for those who share?
Aim 1: Does sharing have benefit for those who share?
Note the logarithmic scale
Aim 1: Does sharing have benefit for those who share?
Conclusion: data sharing is associated with an increase in citation rate
Aim 1: Does sharing have benefit for those who share?
Next:
What factors predict sharing?
http://www.flickr.com/photos/ryanr/142455033/
Can I use the same methods of Aim 1 to choose studies and determine data sharing status?
Can I use the same methods of Aim 1 to choose studies and determine data sharing status?
No, those methods don’t scale to identify or classify enough datapoints
Aim 2
Need automated methods to:
Aim 2a: Identify studies that create datasets
Aim 2b: Determine which of these have in fact been shared
Aim 2a: Identify studies that create gene expression microarray data
http://www.flickr.com/photos/lofaesofa/248546821/
Aim 2a: Identify studies that create gene expression microarray data
Easy, via MeSH indexing terms?
gene expression profiling and/or
microarray analysis
Unfortunately, these have neither high recall nor precision.
Look for wetlab methods in full text:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrezhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745
Aim 2a: Identify studies that create gene expression microarray data
Query environment:
Full-text portals query 85% of articles available through U of Pittsburgh library digital subscriptions.
Development set?
Open access articles.
Features? Unigrams and bigrams from full text
Training classifications? Automatic filter for whether publication had an associated dataset deposited in a database
Feature selection and combination:
Derived query:
("gene expression" AND microarray AND cell AND rna)
AND (rneasy OR trizol OR "real-time pcr")
NOT (“tissue microarray*” OR “cpg island*”)
Evaluation:
Ochsner et al. Nature Methods (2008) vol. 5 (12) pp. 991• 400 studies across 20 journals
Precision: 90% (86% to 93%) Recall: 56% (52% to 61%)
Conclusion: We derived a query with high precision and adequate recall to identify studies that created microarray data
Aim 2a: Identify studies that create gene expression microarray data
Aim 2b
Aim 2b: Identify studies that share their expression microarray data
http://www.flickr.com/photos/dcassaa/422261773/
Aim 2b: Identify studies that share their expression microarray data
Aim 2b: Identify studies that share their expression microarray data
Querying GEO and ArrayExpress for PubMed IDs identified 77% of datasets that were publicly available somewhere on the internet.
Aim 2b: Identify studies that share their expression microarray data
Aim 2b: Identify studies that share their expression microarray data
Aim 2b: Identify studies that share their expression microarray data
Conclusion: we have a method to find most gene expression microarray datasets shared on the internet, without much bias.
Aim 2b: Identify studies that share their expression microarray data
Aim 3
Aim 3 – How often is data shared? What predicts sharing? How can we model sharing behavior?
Aim 2a +
Aim 2b +
lots of stats
http://www.flickr.com/photos/cogdog/123072/
Is research data shared after publication?
Funder Journal Investigator Institution Study
funded by NIH?
size of grant
sharing plan req’d?
funded by non-NIH?
impact factor
strength of policy
open access?
number of microarray studies published
years since first paper
# pubs
# citations
previously shared?
previously reused?
gender
sector
size
impact rank
country
humans?
mice?
plants?
cancer?
clinical trial?
number of authors
year
Funder Journal Investigator Institution Study
journal rank
“An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols available in a publicly accessible database …”
http://www.nature.com/authors/editorial_policies/availability.html
http://www.nature.com/nature/journal/v453/n7197/index.html
journal data sharing policy
institution rank
Yu et al. BMC medical informatics and decision making (2007) vol. 7 pp. 17
study type
Author publication history:
Citation counts:
Author-ity web serviceTorvik & Smalheiser. (2009). Author Name Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11.
Author name disambiguation:
author “experience”
author gender
funding level
PubMed grant lists + NIH grant details
funder mandates
Requires a data sharing planfor studies funded after October 2003
that receive more than $500 000 in direct funding per year
Proxy for NIH data sharing policy applicability:
If in any year since 2004,
• funded by an NIH grant number with a “1” or “2” type code
• received more than $750 000 in total funding from the grant
funder mandates
and so on...
124 variables
Univariate proportions
Factor analysis
Logistic regression
Second-order factor analysis
More logistic regression
stats
http://www.flickr.com/photos/blatzandchocolate/4281306244/
11,603 datapoints
we found shared datasets for 25%
results
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Year article published
Pro
po
rtio
n o
f a
rtic
les w
ith
da
tase
ts f
ou
nd
in
GE
O o
r A
rra
yE
xp
ress
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Proportion of articles with shared datasets, by year
Across time
univariate analysis
Ph
ysio
l G
en
om
ics
PL
oS
Ge
ne
t
Ge
no
me
Bio
l
Microbiology
PL
oS
On
e
BM
C G
en
om
ics
Pla
nt
Ce
ll
Ge
no
me
Re
s
Eu
ka
ryo
t C
ell
Ap
pl E
nviro
n M
icro
bio
lB
MC
Me
d G
en
om
ics
Hu
m M
ol G
en
et
Pro
c N
atl A
ca
d S
ci U
S A
Infe
ct
Imm
un
Am
J R
esp
ir C
ell
Mo
l B
iol
De
v B
iol
J B
acte
rio
l
Mo
l E
nd
ocrin
ol
BM
C C
an
ce
r
Pla
nt
Ph
ysio
lB
iol R
ep
rod
Blood
J I
mm
un
ol
FA
SE
B J
To
xic
ol S
ci
J E
xp
Bo
tN
ucle
ic A
cid
s R
es
Diabetes
Mo
l C
ell B
iol
Mo
l C
an
ce
r T
he
r
BM
C B
ioin
form
atics
Ste
m C
ells
FE
BS
Le
tt
J N
eu
rosci
Am
J P
ath
ol
J B
iol C
he
m
J V
iro
l
OTHER
Ca
nce
r R
es
J C
lin
En
do
crin
ol M
eta
b
Pla
nt
Mo
l B
iol
Clin
Ca
nce
r R
es
Genomics
Inve
st
Op
hth
alm
ol V
is S
ci
Mo
l H
um
Re
pro
dCarcinogenesis
Gene
Endocrinology
Oncogene
Ca
nce
r L
ett
Bio
ch
em
Bio
ph
ys R
es C
om
mu
n
Pro
port
ion o
f data
sets
share
d
0.0
0.2
0.4
0.6
0.8
1.0 Journals
Sta
nfo
rd U
niv
ers
ity
Un
ive
rsity o
f P
en
nsylv
an
ia
Un
ive
rsity o
f Illin
ois
Un
ive
rsity o
f C
alif
orn
ia,
Lo
s A
ng
ele
s
Un
ive
rsity o
f W
isco
nsin
, M
ad
iso
n
Un
ive
rsity o
f W
ash
ing
ton
Un
ive
rsity o
f C
alif
orn
ia,
Da
vis
Th
e U
niv
ers
ity o
f B
ritish
Co
lum
bia
Un
ive
rsity o
f C
alif
orn
ia,
Sa
n F
ran
cis
co
Un
ive
rsity o
f F
lorid
a
Un
ive
rsity o
f C
alif
orn
ia,
Sa
n D
ieg
o
Un
ive
rsity o
f M
inn
eso
ta,
Tw
in C
itie
s
Ba
ylo
r C
olle
ge
of
Me
dic
ine
OTHER
Ma
x P
lan
ck G
ese
llsch
aft
Ha
rva
rd U
niv
ers
ity
Du
ke
Un
ive
rsity M
ed
ica
l C
en
ter
Ya
le U
niv
ers
ity
Jo
hn
s H
op
kin
s U
niv
ers
ity
Un
ive
rsity o
f P
itts
bu
rgh
Wa
sh
ing
ton
Un
ive
rsity in
Sa
int
Lo
uis
Un
ive
rsity o
f T
oro
nto
Un
ive
rsity o
f C
alif
orn
ia,
Be
rke
ley
Un
ive
rsity o
f M
ich
iga
n,
An
n A
rbo
r
Mic
hig
an
Sta
te U
niv
ers
ity
Na
tio
na
l C
an
ce
r In
stitu
te
To
kyo
Da
iga
ku
Pro
po
rtio
n o
f d
ata
se
ts s
ha
red
0.0
0.2
0.4
0.6
0.8
1.0
Institutions
1
101
201
301
401
501
601
701
801
901
1001
1101
1201
1301
1401
1501
1601
1701
1801
1901
Pro
po
rtio
n o
f d
ata
se
ts s
ha
red
0.0
0.2
0.4
0.6
0.8
1.0
Institutionrank
multivariate analysis
factor analysis
logistic regression
Odds Ratio
0.25 0.50 1.00 2.00 4.00 8.00
Has journal policy0.95Count of R01 & other NIH grants
Authors prev GEOAE sharing & OA & microarray creation
NO K funding or P funding
Institution high citations & collaboration
Journal impact
Journal policy consequences & long halflife
NOT animals or mice
Instititution is government & NOT higher ed
Last author num prev pubs & first year pub
Large NIH grant
Humans & cancer
NO geo reuse + YES high institution output
First author num prev pubs & first year pub
Multivariate nonlinear regressions with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00 8.00
Has journal policy0.95Count of R01 & other NIH grants
Authors prev GEOAE sharing & OA & microarray creation
NO K funding or P funding
Journal impact
Journal policy consequences & long halflife
Institution high citations & collaboration
NOT animals or mice
Instititution is government & NOT higher ed
Last author num prev pubs & first year pub
Large NIH grant
Humans & cancer
NO geo reuse + YES high institution output
First author num prev pubs & first year pub
Multivariate nonlinear regressions with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00 8.00
Has journal policy0.95Count of R01 & other NIH grants
Authors prev GEOAE sharing & OA & microarray creation
NO K funding or P funding
Institution high citations & collaboration
Journal impact
Journal policy consequences & long halflife
NOT animals or mice
Instititution is government & NOT higher ed
Last author num prev pubs & first year pub
Large NIH grant
Humans & cancer
NO geo reuse + YES high institution output
First author num prev pubs & first year pub
Multivariate nonlinear regressions with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00 8.00
Has journal policy0.95Count of R01 & other NIH grants
Authors prev GEOAE sharing & OA & microarray creation
NO K funding or P funding
Journal impact
Journal policy consequences & long halflife
Institution high citations & collaboration
NOT animals or mice
Instititution is government & NOT higher ed
Last author num prev pubs & first year pub
Large NIH grant
Humans & cancer
NO geo reuse + YES high institution output
First author num prev pubs & first year pub
Multivariate nonlinear regressions with interactions
second-order factor analysis
Instititu
tio
n is g
ove
rnm
en
t &
NO
T h
igh
er
ed
NO
T in
stitu
tio
n N
CI
or
intr
am
ura
l
NO
K f
un
din
g o
r P
fu
nd
ing
Jo
urn
al p
olic
y c
on
se
qu
en
ce
s &
lo
ng
ha
lflif
e
Au
tho
rs p
rev G
EO
AE
sh
arin
g &
OA
& m
icro
arr
ay c
rea
tio
n
Institu
tio
n h
igh
cita
tio
ns &
co
llab
ora
tio
n
NO
T a
nim
als
or
mic
e
First
au
tho
r n
um
pre
v p
ub
s &
first
ye
ar
pu
b
Hu
ma
ns &
ca
nce
r
Co
un
t o
f R
01
& o
the
r N
IH g
ran
ts
La
rge
NIH
gra
nt
Ha
s jo
urn
al p
olic
y
NO
ge
o r
eu
se
+ Y
ES
hig
h in
stitu
tio
n o
utp
ut
La
st
au
tho
r n
um
pre
v p
ub
s &
first
ye
ar
pu
b
Jo
urn
al im
pa
ct
Journal impact
Last author num prev pubs & first year pub
NO geo reuse + YES high institution output
Has journal policy
Large NIH grant
Count of R01 & other NIH grants
Humans & cancer
First author num prev pubs & first year pub
NOT animals or mice
Institution high citations & collaboration
Authors prev GEOAE sharing & OA & microarray creation
Journal policy consequences & long halflife
NO K funding or P funding
NOT institution NCI or intramural
Instititution is government & NOT higher ed
logistic regressionusing second-order factors
Odds Ratio
0.25 0.50 1.00 2.00 4.00
OA journal & previous GEO-AE sharing
0.95Amount of NIH funding
Journal impact factor and policy
Higher Ed in USA
Cancer & humans
Multivariate nonlinear regression with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00
OA journal & previous GEO-AE sharing
0.95Amount of NIH funding
Journal impact factor and policy
Higher Ed in USA
Cancer & humans
Multivariate nonlinear regression with interactions
size of effect:split at the medians of the factors
Overall:25%
Open access/previous
sharing: 31%
LessOA/prev
sharing: 19%
Overall:25%
Open access/previous
sharing: 31%
LessOA/prev
sharing: 19%
cancer/human: 18%
Notcancer/human:
32%
Overall:25%
24% 37%Open access/
previous sharing: 31%
13% 25%Less
OA/prev sharing: 19%
cancer/human: 18%
Notcancer/human:
32%
Overall:25%
Conclusions:
• data sharing rates are increasing, but overall levels are low
Preliminary evidence:• levels are particularly low in cancer• levels are highest for those who are publishing OA, have shared before
• data and filters were imperfect• many assumptions• didn’t capture all types of sharing• don’t know how generalizable across datatypes• should be considered hypothesis-generating
http://www.flickr.com/photos/vlastula/300102949/
Goal of this dissertation:
Collect useful evidence on patterns of data sharing behaviour through methods that can be applied broadly, repeatably, and cost-effectively.
contribution
• Aim 1 publication cited 45 times in Google Scholar, including by several editorials and books
• Aim 2 methods reused in a neuroethics study at UBC• Aim 3 revealed evidence suggesting areas with high and
low data sharing adoption for future study• data collection was mostly automated using mostly free,
and open resources• dataset, collection code, analysis scripts to be made
openly available upon publication of thesis
http://www.flickr.com/photos/skrb/2427171774/
what’s next?
More data analysis
Including:• Citation analysis of the 11,603 articles• Analysis with a focus on policy variables• Causality through structural equation
modeling
doi/10.1371/journal.pone.0008469.g002
Begin to investigate reuse
http://www.flickr.com/photos/boitabulle/3668162701/
who reuses data?
when?
why aren’t they?
which datasets are most likely to be reused?
what can we do about it?
how many datasets could be reused but aren’t?
why?
who doesn’t?
what should we do about it?
Postdoctoral Research Associate in the Sharing, Preservation, and Stewardship of Scientific Data
Potential areas of focus include:• overcoming social and technological
barriers to data deposition among scientists
• the roles and interactions of individual scientists, journals/publishers, institutions, and the variety of disciplinary repositories
• ...
Post‐doc of my dreams
http://www.flickr.com/photos/gatewaystreets/3838452287/
Enable new science and knowledge creation through universal access to data about life on earth and the environment that sustains it.
Dryad is a repository of data underlying scientific publications, with an initial focus on evolution, ecology, and related fields.
The National Evolutionary Synthesis Center, NSF-funded:
• Duke University,• UNC at Chapel Hill• North Carolina State University
Data sharing is hard.
I share my code and data at http://www.researchremix.org
It is hard.Some is better than none.Be the change you want to see.
http://www.flickr.com/photos/myklroventine/892446624/
Thanks to
the Dept of Biomedical Informatics at the U of Pittsburgh,
the NLM for funding through training grant 5 T15 LM007059,
those who openly publish their data, source code, papers, photos,
Dr. Wendy Chapman for her support and feedback,
My family.
http://www.flickr.com/photos/jep42/3017149415/in/set-72157608797298056/