Linking literature to data in the life sciences
OpenAIREplus workshop, Copenhagen, 11 June 2012
Overview
• What literature? What data?
• How we make literature-data connections
• Case study
• Challenges and future directions
What literature? What data?
Big Data:
Deposition
Primary
Research
articles
Big Data:
Curated
Annotation
Unstructured Data
Funder mandatesJournal requirementsMetadata
Standards
Data Landscape and Definitions
*reuse
PMC336623 Extended to several other biological data types
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
Nucle
otid
es (m
illio
ns) European Nucleotide Archive
0
50
100
150
200
250
300
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Ensembl and Ensembl Genomes
Year
Geno
mes
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
UniProt
Year
Entr
ies
InterPro
0
5000
10000
15000
20000
25000
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
Entr
ies
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
Year
ArrayExpress
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Hyb
rid
isatio
ns
Str
uctu
res
0
10000
20000
30000
40000
50000
60000
70000
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
PDBe
• Big data• Thematic data
• Public data• Archived data
• Two petabytes of data
• Scales to 7 pbs raw disk
• Majority is DNA
Two core literature databases
• 26 million abstracts
PubMed, Patents, Agricola• Website and web services
• 2.2 million full text articles(217K articles with suppl data)
• Website
• Citation networks
• Database links
• Whatizit textmining
• Supplemented by CiteXplore
• Additional text mining
• over 1.1 million new records per year • over 150K new articles per year
UK PubMed Central Overview
• Built in collaboration with PubMed Central USA (+ PMC Canada) since 2006
• Led by the European Bioinformatics Institute since 2011, with the
British Library, and the University of Manchester
• Supported by 16 UK and 2 European Funders, led by the WellcomeTrust. Research spend: ~ 2 billion GBP
• A life-science web-based repository
• Manuscript submission service (self archiving by grant holders)
• Database of grant information – with details of about 18000 PIs
• Grant reporting and funder analysis tool
• 250K requests, 40K IPs, 7K direct interactive searches per day
How many articles?
Overall: 20% OA (~ 450K OA articles out of 2.2 million total)
How we make literature-data connections
Links
• by the author - on submission, as metadata (primary databases)
• by database curators - information and links from the
literature
• expensive, slow, but high quality
Text mining
• by algorithms that use terminologies (can be subject to lag)
• post publication – can find new associations
• variable quality, but high throughput
Links from Literature to Databases
• Proteins
• Nucleotides
• OMIM
• Chemicals
• Structure
• Clinical reviews
• Protein families
• Protein-protein interactions
• Gene expression experiments …
800 K
370 K
110 K
Semantic Type Unique Terms Articles Annotations
Gene/Protein 225,905 1,288,809 15,021,502
GO Terms 32,486 1,806,539 15,016,957
Organism 178,847 1,689,251 12,322,782
Disease 170,592 1,743,212 16,201,198
Accession No. 232,950 65,640 331,329
Chemical 76,350 1,669,500 22,438,980
Text Mining in UKPMC (2.2 million articles)
Case study
3.9 billion years ago
E. Coli meets humans
Human colon cancer DNA repair
07/21/10 17
Protein structure in PDBe
Link to the literature from the PDBe record
Algorithms that find similar structures
Text mine full text for 1ewq
Towards understanding DNA repair mechanisms
Challenges and future directions
Data-driven science
Data re-use: biology is
post publication
Linking: citing papers
and data (provenance
and integration)
Metrics and attribution
Hard decisions about
value of keeping
complete data sets
Big Data:
Deposition
Primary
Research
articles
Big Data:
Curated
Annotation
Unstructured Data
Data landscape - possibilities
reuse?
Structured links
analysis
Analysis supplied by Mimas, University of Manchester
TIF
XSLDOC
MOV
HTML
GIF
JPG
Solutions that make sense to scientists
http://ukpmc.ac.uk