Download - Content Mining at Wellcome Trust
![Page 1: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/1.jpg)
CONTENT-MINING IN SCIENCE
TheContentMine Progress since “Hargreaves” legislation
Opportunities for UK, and Europe
Peter Murray-Rust, 2015-04-14Workshop sponsored by Wellcome Trust
![Page 2: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/2.jpg)
OUR TEAM
@jenny_molloy
Ross Mounce
@rmounce
Richard Smith-Unna
@blahah404
Stephanie Smith-Unna
@treblesteph
Jenny Molloy
Mark MacGillivray
@cottagelabs
Peter Murray-Rust
@petermurrayrust
Charles Oppenheim
@CharlesOppenh
Graham Steel
@McDawg
![Page 3: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/3.jpg)
OUR MISSION
“make 100,000,000 facts from the STEM literature open, accessible and reusable”
![Page 4: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/4.jpg)
WHY?http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-ebola.html
We were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be
prepared to avoid nosocomial epidemics,” referring to hospital-acquired infection.
Adage in public health: “The road to inaction is paved with research papers.”
Bernice Dahn is the chief medical officer of Liberia’s Ministry of Health, where Vera Mussah is the director of county health services. Cameron Nutt
is the Ebola response adviser to Partners in Health.
![Page 5: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/5.jpg)
THE RIGHT TO READ ISTHE RIGHT TO MINE
The Hargreaves report (UK) , legalised 2014, allowing
limitations and exceptions for non-commercial content mining
for research.The Hague decal
![Page 6: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/6.jpg)
THE SCALE OF THE TASK• ~ 27,000 peer reviewed journals*
• > 5,000 publishers
• ~ 3,000 new papers per day
• “costing” 15 Billion USD to publish
• Representing 500 Billion USD of research*Ulrich’s database:
http://ulrichsweb.serialssolutions.com/login
![Page 7: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/7.jpg)
OUR WORKSHOPS
• Shuttleworth Foundation• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute (x2)• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US• JISC , London• LIBER • Cochrane UK• British Library• Wellcome Trust• WHO
OUR COLLABORATORS• Shuttleworth Foundation• Wikimedia/Wikidata• Mozilla• Open Knowledge• LIBER • British Library• Wellcome Trust• EBI (Eur. Bioinf. Inst.)• JISC• BBSRC• Cochrane UK• Open Access Button• SPARC• Creative Commons• CORE• EuropePubmedCentral• Cambridge University Library
![Page 8: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/8.jpg)
STRUCTURED INFORMATION
• chemical names and structures
• species
• metabolism
• phylogenetic trees
• …
![Page 9: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/9.jpg)
INTERACTIVE DEMOof content mining
http://chemicaltagger.ch.cam.ac.uk/
![Page 10: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/10.jpg)
ContentMine at Cochrane UK, 2015-03-16
![Page 11: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/11.jpg)
CLINICAL TRIALSHow to we find (mentions of) clinical trials?
Is a document a (clinical) trial?What is the subject of the trial?
What is the methodology used? How many/long?Does the design and practice conform to CONSORT?
What are the outcomes?Can we extract specific re-usable information?
Who are involved? (researchers, sponsors, patients?)Has a proposed trial been completed and reported?
![Page 12: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/12.jpg)
COMMUNITY PROJECTS
• Clinical Trials (with Cochrane UK)
• Phyloinformatic Literature Unlocking Tools (PLUTo/BBSRC)
• EBI – MetaboLights
• Plant Sciences and farming (Cambridge, TGAC, OpenFarm)
• Crystallography Open Database (COD)
• OpenOil / OpenCorporates
![Page 13: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/13.jpg)
METABOLIGHTS
• European Bioinformatics Institute
• database for metabolomics experiments and derived
information
• cross-species, cross-technique, structures, biological
roles, locations, concentrations
• http://www.ebi.ac.uk/metabolights/
![Page 14: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/14.jpg)
CONTENTMINE WORKSHOPS AND HACKDAYS
Open Science Brazil, 2014-08
Easily distributed software
Get started in 30 mins
Build application in a day
Start simple: bagOfWords, Stemming, Regex, templates
![Page 15: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/15.jpg)
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF
CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRYTEXT
MATH
contentmine.org tackles these
![Page 16: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/16.jpg)
What is “Content”?Emily Sena (neuroscience.ed.ac.uk) spends
half a day digitising a diagram like this
ContentMine will soon be able to do it in 1 second
![Page 17: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/17.jpg)
Note Jaggy and broken pixels
NEW Bacteria must have a phylogenetic tree
Length_________Weight Binomial Name Culture/Strain GENBANK ID
EvolutionRate
![Page 18: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/18.jpg)
• CRAWL the web for scientific documents (articles, grey literature, repositories)• quickSCRAPE pages (text, graphics, images, data)• NORMA-lize page to semantic form
…Open semantic science …• MINE pages with your methods and tools (AMI)
• CAT-alogue results in searchable index• Automate daily process (CANARY)
contentmine.org Infrastructure
![Page 19: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/19.jpg)
quickscrapeCrawlFeed Norma Index &
Transform
XML
URL
DOI
Scientificliterature
Repositories DOC
CSV
sHTML
Plugins
Regex
SequencesSpecies
Bespoke
Scrapers
XPathPer-Journal
Taggers
Per- Journal
MetadataChemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
Open NORMA-lized Scientific Literature + Facts
CANARY pipeline
CAT-alogue index
![Page 20: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/20.jpg)
POSSIBLE USES• Indexing/searching the literature; G***** for science
• Current awareness; alerts and practices
• Extraction and re-use of facts; re-computation
• Multidisciplinary integration; co-occurrence
• Compliance with funder/institution policies
• Managing your Research Data!
• Finding similar and complementary colleagues
• Reproducibility, checking data and avoiding fraud
![Page 21: Content Mining at Wellcome Trust](https://reader035.vdocuments.us/reader035/viewer/2022062503/589df46b1a28ab1e718b4973/html5/thumbnails/21.jpg)
How to leverage Content Mining for benefit of UK/EU
• Create UK showcase of successes in mining
• Graduate training by 3rd year UK graduate students.
• Develop EuropePMC as world resource for bio-mining
• Training/support for UK/EU libraries about Hargreaves.
• Central collection of born-digital UK theses
• Collect pre-copyright author manuscripts
• Integrate CM into Research Data Management tools
• Promote mining in all aspects of healthcare information
• Open collection of extracted scientific facts for the world