content mining of science in europe
TRANSCRIPT
![Page 1: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/1.jpg)
Content Mining of Science in Europe
Peter Murray-Rust, ContentMine.org, University of Cambridge & Open Forum Europe
OFA, Brussels, BE 2015-10-22
What is mining?Why is it useful?
How YOU can do it without using publishers’ APIsCopyright and restrictive practices are still a major problem
![Page 2: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/2.jpg)
The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011
http://contentmine.org
![Page 3: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/3.jpg)
My European Heroes
Young People(ContentMine)
NEELIE KROES
![Page 4: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/4.jpg)
Use Cases of ContentMining
• Epidemiology of obesity (Cambridge U)• (OKF, OpenTrials) Mapping clinical trials
repositories to reports in scientific literature• Mining chemical reactions from patents• Creating a bacterial supertree-of-life from
4500 papers
![Page 5: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/5.jpg)
Polly has 20 seconds to read this paper…
…and 10,000 more
![Page 6: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/6.jpg)
ContentMine software can do this in a few minutes
Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”
![Page 7: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/7.jpg)
400,000 Clinical TrialsIn 10 government registries
Mapping trials => papers
http://www.trialsjournal.com/content/16/1/80
2009 => 2015. What’s happened in last 6 years??
Search the whole scientific literatureFor “2009-0100068-41”
![Page 8: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/8.jpg)
ContentMine-ing strategy• Discover. Crawl the COMPLETE relevant literature.
=> bibliography• Scrape (download). ALL papers• Index papers => Facts• Search/analyze papers => complex science• Extract, Annotate, Aggregate (“Transformative”)
![Page 9: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/9.jpg)
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRYTEXT
MATH
contentmine.org tackles these
![Page 10: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/10.jpg)
catalogue
getpapers
query
DailyCrawl
EuPMC, arXivCORE , HAL,(UNIV repos)
ToCservices
PDF HTMLDOC ePUB TeX XML
PNGEPS CSV
XLSURLsDOIs
crawl
quickscrape
normaNormalizerStructurerSemanticTagger
Text
DataFigures
ami
UNIVRepos
search
LookupCONTENTMINING
Chem
Phylo
Trials
CrystalPlants
COMMUNITY
plugins
Visualizationand Analysis
PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…
Publisher Sites
scrapersqueries
taggers
abstract
methods
references
CaptionedFigures
Fig. 1
HTML tables
30, 000 pages/day Semantic ScholarlyHTML
Facts
CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
![Page 11: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/11.jpg)
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
![Page 12: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/12.jpg)
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
![Page 13: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/13.jpg)
Facts in contextdaily IUCN endangered species news
en.wikipedia.org CC By-SA
![Page 14: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/14.jpg)
ContentMine Fact of The Day
• Fact of the day• Endangered species in recent science• Facts• Bubbles
![Page 15: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/15.jpg)
https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA
![Page 16: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/16.jpg)
“Root” 4500 papers each with 1 tree
![Page 17: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/17.jpg)
OCR (Tesseract)
Norma (imageanalysis)
(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);
Semantic re-usable/computable output (ca 4 secs/image)
![Page 18: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/18.jpg)
Supertree for 924 species
Tree
![Page 19: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/19.jpg)
Supertree created from 4300 papers
![Page 20: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/20.jpg)
Copyright and Mining
• PMR-premise: You cannot do reproducible scientific mining and avoid violating copyright.
• UK (“Hargreaves”) 2014 legislation:– “personal” “non-commercial*” “research” “data
analytics”– legitimizes copying (?to disk), but not publishing
*teaching, textbooks, etc. may be “commercial”
![Page 21: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/21.jpg)
Publishing and ICT
Trust these as much as you trust these
Elsevier Microsoft
Mendeley (Elsevier) Facebook
Digital Science/Macmillan Apple
Wileyetc
Etc.
![Page 22: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/22.jpg)
STM Publishers prevent Mining• FUD & disinformation about legality (Elsevier)• Monopolies on infrastructure (“API”s, CCC
Rightfind)• Technical obstruction (Wiley Captcha,
Macmillan Readcube)• Restrictive contracts with libraries (ALL) [1]• Wasting my/our time (ALL)
[1] [You may not] utilize the TDM Output to enhance … subject repositories in a way that would [… ] have the potential to substitute and/or replicate any other existing Elsevier products, services and/or solutions.
![Page 23: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/23.jpg)
WILEY … “new security feature… to prevent systematic download of content
“[limit of] 100 papers per day”
“essential security feature … to protect both parties (sic)”
CAPTCHAUser has to type words
![Page 24: Content Mining of Science in Europe](https://reader033.vdocuments.us/reader033/viewer/2022042907/58814da61a28abb0508b53c7/html5/thumbnails/24.jpg)
ContentMine working with Libraries
• Cambridge: Library, Plant Sciences, Epidemiology, Chemistry
• Cochrane Collaboration on Systematic Reviews of Clinical Trials
• FutureTDM (H2020, LIBER)• Running workshops and training