Download - Content Mining for Machines and Humans
![Page 1: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/1.jpg)
Content-Mining for Machines and HumansPeter Murray-Rust
contentmine.orgWellcomeTrust, London, 2015-03-06
• Extract 100 million facts (CC0) from the scientific literature per year
• Grow communities and give everyone the tools and know-how to mine science
![Page 2: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/2.jpg)
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRYTEXT
MATH
contentmine.org tackles these
![Page 3: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/3.jpg)
Machine-Human symbioses
• Wikipedia• Open StreetMap
We aim to make it trivial for a human+machine to mine the scientific literature. By building Communities
![Page 4: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/4.jpg)
Workshops and Hackdays
Open Science Brazil, 2014-08
Easily distributed software
Get started in 30 mins
Build application in a morning
Start simple: bagOfWords, Stemming, Regex, templates
![Page 5: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/5.jpg)
Oxford 2013
Berlin 2014
Delhi 2014
Jenny Molloy with mascot AMI
![Page 6: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/6.jpg)
• CRAWL the web for scientific documents (articles, grey literature, repositories)• quickSCRAPE pages (text, graphics, images, data)• NORMA-lize page to semantic form
…Open semantic science …• MINE pages with your methods and tools (AMI)
• CAT-alogue results in searchable index• Automate daily process (CANARY)
contentmine.org Infrastructure
![Page 7: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/7.jpg)
quickscrapeCrawlFeed Norma Index &
Transform
XML
URL
DOI
Scientificliterature
Repositories DOC
CSV
sHTML
PluginsRegex
SequencesSpecies
Bespoke
ScrapersXPathPer-Journal
TaggersPer- Journal
MetadataChemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
Open NORMA-lized Scientific Literature + Facts
CANARY pipeline
CAT-alogue index
![Page 8: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/8.jpg)
https://commons.wikimedia.org/wiki/File:Flickr_-_DVIDSHUB_-_RSP_Warrior_Challenge_Prepares_Soldiers_Mentally,_Physically_%281%29.jpg
CRAWLing the Literature
NO Central Table of Contents
Massive technical, political, legal opposition
Little interest from Academia
Tedious
Few general tools
![Page 9: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/9.jpg)
The Right to Read is The Right To Mine
PMR in 2012: http://blog.okfn.org/2012/06/01/the-right-to-read-is-the-right-to-mine/
![Page 10: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/10.jpg)
SCRAPE
https://en.wikipedia.org/wiki/Gleaning#mediaviewer/File:Millet_Gleaners.jpg PublicDomain
HTML
XML quickscrape*
*Scrapers created by Richard Smith-Unna + Community
HTMLPDFXMLPNGSVGCSVDOCLaTeXCIF…
Non-standard per-publisher site
![Page 11: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/11.jpg)
https://en.wikipedia.org/wiki/W._Heath_Robinson#mediaviewer/File:Robinson%28WH%29-%28%27Uncle_Lubin%27%29.jpg PublicDomain
NORMA-lization of Scientific Literature
PDFs, Broken HTMLPNGs for Math, etc.
NORMA
UnicodeDiacriticsWell-formedSectionedTaggedSVG diagrams
![Page 12: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/12.jpg)
AMI-plugins• BagOfWords, Stemming and Regular Expressions• Species• Biological Sequences• Chemical compounds & reactions
• Farming * (Rory Aaronson)
• Crystallography * (Saulius Grazulis, COD)• Clinical Trials * (Amy Price)
• Phylogenetics * (Ross Mounce)
• Phytochemistry * (Chris Steinbeck, PMR)
* subcommunities
![Page 13: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/13.jpg)
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
![Page 14: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/14.jpg)
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
![Page 15: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/15.jpg)
UNITS
TICKS
QUANTITYSCALE
TITLES
DATA!!2000+ points
VECTOR PDF
![Page 16: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/16.jpg)
Dumb PDF
CSV
SemanticSpectrum
2nd Derivative
Smoothing Gaussian Filter
Automaticextraction
![Page 17: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/17.jpg)
https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/ for story of extraction
Thinning Topology
Serialization
Newick
![Page 18: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/18.jpg)
https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0
Daily Stream of 100,000 Open Facts
Twitter?Indexed by CAT
![Page 19: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/19.jpg)
Phytochemistry extraction
O. dayi
“volatile composition of “
A.sibeiri
A. judaica
Displayed by CAT (CottageLabs)
![Page 20: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/20.jpg)
Workshops (1-hour -> full day or more)
2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US• JISC , London
Upcoming• LIBER • Cochrane• BL• Wellcome Trust (April)• WHO
Collaborators
• Wikimedia/Wikidata• Mozilla• Open Knowledge• LIBER (European Research Libraries)• British Library• Wellcome Trust• EBI (Eur. Bioinf. Inst.)• JISC• Open Access Button• SPARC• Creative Commons• CORE
![Page 21: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/21.jpg)
contentmine.org proposed Services
• Workshops• Repository indexing• Funder Compliance• Publication enhancement• Extraction of scientific data
![Page 22: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/22.jpg)
contentmine.org team
![Page 23: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/23.jpg)
![Page 24: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/24.jpg)
Bacterial WP_phylogenetic tree
Our machines have read and interpreted 4300 in an hour with > 95% accuracy
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
WP: Clostridium_butyricum
Genbank ID
American Type Culture Collection
![Page 25: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/25.jpg)
https://en.wikipedia.org/wiki/Track_gauge#mediaviewer/File:IndianGauges.JPG CC-BY
![Page 26: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/26.jpg)
RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs
QueuesRepos
Scientificliterature
SciencePlugins
ScienceVolunteers
Collaboration with Open Access Button
![Page 27: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/27.jpg)
Daily stream of 300,000 facts
https://commons.wikimedia.org/wiki/File:Rapid_stream.jpg Public Domain
![Page 28: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/28.jpg)
https://en.wikipedia.org/wiki/The_Cat_and_the_Canary_%281927_film%29#mediaviewer/File:Thecatandthecanary-windowcard-1927.jpg Public domain
CAT and CANARY
![Page 29: Content Mining for Machines and Humans](https://reader034.vdocuments.us/reader034/viewer/2022051708/58814d7b1a28abb0508b5321/html5/thumbnails/29.jpg)
AMI Demo
http://www.mdpi.com/2218-1989/2/1/39/pdf
https://bitbucket.org/AndyHowlett/ami2-poc
ami2-poc -i example -v org.xmlcml.xhtml2stm.visitor.chem.ChemVisitor
May take time to start if not connected to web
Output:./target/output/reactionsexample/
SVG: ./page1annotated.svg
CML: image.g.1.4.svg.reaction0.cml AvogadroViewer: