Can machines “understand” the scientific literature?
Peter Murray-Rust, Reader Emeritus, Dept of Chemistry, Univ Cambridge
and Founder TheContentMine
Trinity College Science Society, Cambridge UK, 2017-02-21
contentmine.org is supported by a grant to PMR as a
(2x digital music industry!)
ContentMine is an OpenLocked Non-Profit company
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRYTEXT
MATH
contentmine.org tackles these
AMI! Tell me what YOU know about monoxidine?
Wikipedia
Wikidata for moxonidine
Wikidata for moxonidine
Entity extraction
OPSIN says this name is wrong! OSIRIS will interpret this structureIncluding the annotation
Reaction Schemes
Tables
Tables
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
Automatic semantic markup of chemistry
Could be used for analytical, crystallization, etc.
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
6 ContentMine Fellows for 6 months
Neo Christopher Chung
Warsaw, Computational Biology Wants to find out geographic and temporal differences in the use of genomic software tools
Paola Masuzzo Ghent, Computational Omics and Systems Biology Wants to mine literature around cell migrations and invasion to create 1) collection of
minimum requirements, 2) check for nomenclatura consistency and 3) construct a knowledge map
Alexandra Bannach-Brown Edinburgh, Neuroscience Problem: huge body of works in animal studies about depressions. systematic review is the main
approach for getting insight. Wants: identify papers in systematic review of depressive behaviour in animals. What
drugs, what methods, what outcomes and signs/phenotypes. Use outcomes for document clustering.
and expedite scientific advances."
Corpus: 70.000 Papers
Alexandre Hannud Abdo “Our goal is to mine facts from global health research and provide automated referenced
summaries to practitioners and agents who don’t have the means or the time to navigate the literature.
From Brazil, Life Sciences, works on project about evolution of oncology Wants: extract facts from cancer research conference papers and global health papers
OPEN NOTEBOOK RESEARCH
Alexandre Hannud Abdo “Our goal is to mine facts from global health research and provide automated referenced
summaries to practitioners and agents who don’t have the means or the time to navigate the literature.
From Brazil, Life Sciences, works on project about evolution of oncology „I am extremely happy to join this first cohort of ContentMine Fellows. I participated in a
ContentMine workshop in 2014 and have been following the progress of the project ever since, looking for an opportunity to collaborate which now materializes.“
Problem: Get text and metadata out of old conference proceedings and measure the evolution of ideas and practice using entity analysis, especially trends.
Wants: extract facts from cancer research conference papers and global health papers. Extracting topics (innovations, developments) and comparing the two types of publications. Find out which facts from conferences get later on published in articles.
Has some issues with software
Guanyang Zhang Biology, Arizona „My ContentMine Fellowship project will focus on mining weevil-plant associations from literature
records.“ „Motivation. Comprising ~70,000 described and 220,000 estimated species, weevils
(Curculionoidea) are one of the most diverse plant-feeding insect lineages and constitute nearly 5% of all known animals.“
„Knowledge of host plant associations is critical for pest management, conservation, and comparative biological research. This knowledge is, however, scattered in 300 years of historical literature and difficult to access.“
Weevil-plant association network graph made with Google Fusion Table. Each blue circle is a weevil tribe and yellow circle a plant genus. The size of a circle represents the number of associations.
Lars Willighagen 15 years old NL Wants: extract data about conifers (relations to chemicals, height etc.) Outcome: database with webpage containing conifer properties Table Facts Visualiser DEMO Card DEMO Word Cloud „ I applied to this fellowship to learn new things and combine the ContentMine with two previous
projects I never got to finish, and I got really excited by the idea and the ContentMine at large.“
Multisegment diagram
Multisegment diagram
Whitespace “corridors”
SuperpixelBounding box
Semanticlabels
Chemical Computer Vision
Raw Mobile photo; problems:Shadows, contrast, noise, skew, clipping
Binarization (pixels = 0,1)
Irregular edges
Posterisation
Extracted since unique posterized colour
Note Jaggy and broken pixels
NEW Bacteria must have a phylogenetic tree
Length_________Weight Binomial Name Culture/Strain GENBANK ID
EvolutionRate
Supertree for 924 species
Tree
UNITS
TICKS
QUANTITYSCALE
TITLES
DATA!!2000+ points
VECTOR PDF
Dumb PDF
CSV
SemanticSpectrum
2nd Derivative
Smoothing Gaussian Filter
Automaticextraction
C) What’s the problem with this spectrum?
Org. Lett., 2011, 13 (15), pp 4084–4087
Original thanks to ChemBark
After AMI2 processing…..
… AMI2 has detected a square
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
Search on publicly accessible papers on “Zika”
https://rawgit.com/ContentMine/amidemos/master/zika/full.dataTables.html
“… simulated by 21cmFAST is in principle independent”
“it is a feature of the 21cmFAST code, and is explained in §3.1.”
SciCodes[1]: Searching for software in arXiv[1]
[1] Proposal to LJ Arnold Foundation (Alice Allen ASCL and PMR)
Using the semi-numerical simulation, 21cmFAST,
[2] arxiv.org: the physics/maths/astronomy.. Preprint server
The language identifies the software!
arxIv has >500 mentions of “21cmFast”
Questions and comments
Thanks:• Andy Howlett, Dept Chemistry, Cambridge• Mark Williamson, Dept Chemistry, Cambridge• Ross Mounce, Biology, University of Bath• Shuttleworth Foundation
PM-R has offered to mentor an MSc project this summer for anyone interested.
contentmine.org