why contentmining is useful
TRANSCRIPT
![Page 2: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/2.jpg)
Background• Contentmine aims to make large areas of scientific fact OPEN (100 million
facts/year)• We’re working with WellcomeTrust, Europe PubMedCentral, etc.• A politically “hot” area (Hargreaves legislation, EU activity)• 2015 WellcomeTrust workshop on TDM and Neuroscience; “rough
consensus” on what was needed.• Day workshop at Cochrane, UK (Amy Price, Anna Noel Storr, Ben Goldacre)• 2-day workshop at Edinburgh on Systematic Reviews of Animal Test
publications• In the last few months we’ve prototyped a unique Open starting point,
continuously released.• Now actively building communities (plants, clinical, animals, psychology,
crystallography, HEPhys)
![Page 3: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/3.jpg)
http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-ebola.html
We were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be
prepared to avoid nosocomial epidemics,” referring to hospital-acquired infection.
Adage in public health: “The road to inaction is paved with research papers.”
Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)Vera Mussah (director of county health services)
Cameron Nutt (Ebola response adviser to Partners in Health)
A System Failure of Scholarly Publishing
![Page 4: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/4.jpg)
catalogue
getpapers
query
DailyCrawl
EuPMC, arXivCORE , HAL,(UNIV repos)
ToCservices
PDF HTMLDOC ePUB TeX XML
PNGEPS CSV
XLSURLsDOIs
crawl
quickscrape
normaNormalizerStructurerSemanticTagger
Text
DataFigures
ami
UNIVRepos
search
LookupCONTENTMINING
Chem
Phylo
Trials
CrystalPlants
COMMUNITY
plugins
Visualizationand Analysis
PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…
Publisher Sites
scrapersqueries
taggers
abstract
methods
references
CaptionedFigures
Fig. 1
HTML tables
30, 000 pages/day Semantic ScholarlyHTML
Facts
![Page 5: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/5.jpg)
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRYTEXT
MATH
contentmine.org tackles these
![Page 6: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/6.jpg)
What is “Content”?
![Page 7: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/7.jpg)
How “data” are published in the 21st C
![Page 8: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/8.jpg)
ContentMine Workshops and Hackdays
Open Science Brazil, 2014-08
Easily distributed software
Get started in 30 mins
Build application in a morning
Start simple: bagOfWords, Stemming, Regex, templates
![Page 9: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/9.jpg)
Oxford 2013
Berlin 2014
Delhi 2014
Jenny Molloy with mascot AMI
![Page 10: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/10.jpg)
Workshops (1-hour -> full day or more)
2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US• JISC , London
Upcoming• LIBER • Cochrane• BL• Wellcome Trust (April)• WHO
Collaborators
• Wikimedia/Wikidata• Mozilla• Open Knowledge• LIBER (European Research Libraries)• British Library• Wellcome Trust• EBI (Eur. Bioinf. Inst.)• JISC• Open Access Button• SPARC• Creative Commons• CORE• EuropePubmedCentral
![Page 11: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/11.jpg)
Facts Marked by “non-scientists” in ContentMine workshops
With Wikipedia everyone can be a scientist
![Page 12: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/12.jpg)
Linked Open Data – the world’s knowledge
very little physical science and THESES?? http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png
DBPedia
BIO
Comp
Lib
PDB
Ontologies
GOV
GOV.uk
Music,ArtLiterature
Social
Knowledgebases
RDF triples
![Page 13: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/13.jpg)
https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0
Daily Stream of 100,000 Open Facts
Twitter?Indexed by CAT
![Page 14: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/14.jpg)
PLoSONE BMC1
BMC2
Closed1 Closed2Hybrid
CATalog
Enhanced annotated articles
FACTSFACTS
Daily Crawl
Crawl … Scrape … Normalize … Mine
Linked OpenData
Semantic Scientific Objects
2000-5000 Articles
![Page 15: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/15.jpg)
quickscrapeCrawlFeed Norma Index &
Transform
TXTXML
URL
DOI
Scientificliterature
Repositories DOC
CSV
sHTML
PluginsRegex
SequencesSpecies
Bespoke
ScrapersXPathPer-Journal
TaggersPer- Journal
MetadataChemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
Open NORMA-lized Scientific Literature + Facts
CANARY pipeline
CAT-alogue index
![Page 16: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/16.jpg)
AMI-plugins• BagOfWords, Stemming and Regular Expressions• Species• Biological Sequences• Chemical compounds & reactions
• Crystallography * (Saulius Grazulis, COD)• Clinical Trials * (Amy Price)
• Phylogenetics * (Ross Mounce)
• Phytochemistry * (Chris Steinbeck, PMR)• Psychology * (Chris Hartgerink)• HighEnergyPhysics (Durham) * subcommunities
![Page 17: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/17.jpg)
Questions we can tackle• How to we find (mentions of) clinical/animal trials?• Is a document a trial?• What is the subject of the trial?• What is the methodology used?• Does the design and practice conform to
CONSORT/ARRIVE?• What are the outcomes?• Can we extract specific re-usable information?• Who are involved? (researchers, sponsors, patients?)• Has a proposed trial been completed and reported?
![Page 18: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/18.jpg)
Text-based plugins
• Bag of words (https://en.wikipedia.org/wiki/Bag-of-words_model)
• https://en.wikipedia.org/wiki/Tf%E2%80%93idf (Term-frequency, inverse document frequency)• Templates and regexes (regular expressions).
![Page 19: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/19.jpg)
“Bag of Words”
Three fulltext articles from trialsjournal.com
![Page 20: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/20.jpg)
Regular Expressions for Systematic Reviews of Animal Tests
Preceding TextFollowing Text
Extracted term
![Page 21: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/21.jpg)
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
![Page 22: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/22.jpg)
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
![Page 23: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/23.jpg)
![Page 24: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/24.jpg)
Ln Bacterial load per fly
11.5
11.0
10.5
10.0
9.5
9.0
6.5
6.0
Days post—infection
0 1 2 3 4 5
Bitmap Image and Tesseract OCR
![Page 25: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/25.jpg)
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
![Page 26: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/26.jpg)
https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/ for story of extraction
Thinning Topology
Serialization
Newick
![Page 27: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/27.jpg)
Phytochemistry extraction
O. dayi
“volatile composition of “
A.sibeiri
A. judaica
Displayed by CAT (CottageLabs)
![Page 28: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/28.jpg)
What we can do
• Recognize and promote autonomous sub-communities
• Engage Early Career Researchers, including undergraduates and let THEM BUILD the systems.
• COMMUNALLY build tools for data checking• Insist on semantic data input, even if it costs
submissions
![Page 29: Why ContentMining is useful](https://reader035.vdocuments.us/reader035/viewer/2022070509/58a54e5b1a28abef2c8b4be7/html5/thumbnails/29.jpg)
contentmine.org team