why contentmining is useful

The Right to Read is the Right to Mine

http://contentmine.org

http://contentmine.org/

Background• Contentmine aims to make large areas of scientific fact OPEN (100 million

facts/year)• We’re working with WellcomeTrust, Europe PubMedCentral, etc.• A politically “hot” area (Hargreaves legislation, EU activity)• 2015 WellcomeTrust workshop on TDM and Neuroscience; “rough

consensus” on what was needed.• Day workshop at Cochrane, UK (Amy Price, Anna Noel Storr, Ben Goldacre)• 2-day workshop at Edinburgh on Systematic Reviews of Animal Test

publications• In the last few months we’ve prototyped a unique Open starting point,

continuously released.• Now actively building communities (plants, clinical, animals, psychology,

crystallography, HEPhys)

http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-ebola.html

We were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be

prepared to avoid nosocomial epidemics,” referring to hospital-acquired infection.

Adage in public health: “The road to inaction is paved with research papers.”

Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)Vera Mussah (director of county health services)

Cameron Nutt (Ebola response adviser to Partners in Health)

A System Failure of Scholarly Publishing



catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Facts

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF

What is “Content”?

How “data” are published in the 21st C

ContentMine Workshops and Hackdays

Open Science Brazil, 2014-08

Easily distributed software

Get started in 30 mins

Build application in a morning

Start simple: bagOfWords, Stemming, Regex, templates

Oxford 2013

Berlin 2014

Delhi 2014

Jenny Molloy with mascot AMI

Workshops (1-hour -> full day or more)

2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US• JISC , London

Upcoming• LIBER • Cochrane• BL• Wellcome Trust (April)• WHO

Collaborators

• Wikimedia/Wikidata• Mozilla• Open Knowledge• LIBER (European Research Libraries)• British Library• Wellcome Trust• EBI (Eur. Bioinf. Inst.)• JISC• Open Access Button• SPARC• Creative Commons• CORE• EuropePubmedCentral

Facts Marked by “non-scientists” in ContentMine workshops

With Wikipedia everyone can be a scientist

Linked Open Data – the world’s knowledge

very little physical science and THESES?? http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png

DBPedia

BIO

Comp

Lib

PDB

Ontologies

GOV

GOV.uk

Music,ArtLiterature

Social

Knowledgebases

RDF triples

http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png

https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0

Daily Stream of 100,000 Open Facts

Twitter?Indexed by CAT

https://en.wikipedia.org/wiki/Irrigation

https://en.wikipedia.org/wiki/Irrigation

http://creativecommons.org/licenses/by-sa/3.0

PLoSONE BMC1

BMC2

Closed1 Closed2Hybrid

CATalog

Enhanced annotated articles

FACTSFACTS

Daily Crawl

Crawl … Scrape … Normalize … Mine

Linked OpenData

Semantic Scientific Objects

2000-5000 Articles

quickscrapeCrawlFeed Norma Index &

Transform

TXTXML

URL

DOI

Scientificliterature

Repositories DOC

CSV

sHTML

PluginsRegex

SequencesSpecies

Bespoke

ScrapersXPathPer-Journal

TaggersPer- Journal

MetadataChemistry

Phylogenetics Farming

AMI

BadHTML

OCR

Diagrams

Open NORMA-lized Scientific Literature + Facts

CANARY pipeline

CAT-alogue index

PDF

AMI-plugins• BagOfWords, Stemming and Regular Expressions• Species• Biological Sequences• Chemical compounds & reactions

• Crystallography * (Saulius Grazulis, COD)• Clinical Trials * (Amy Price)

• Phylogenetics * (Ross Mounce)

• Phytochemistry * (Chris Steinbeck, PMR)• Psychology * (Chris Hartgerink)• HighEnergyPhysics (Durham) * subcommunities

Questions we can tackle• How to we find (mentions of) clinical/animal trials?• Is a document a trial?• What is the subject of the trial?• What is the methodology used?• Does the design and practice conform to

CONSORT/ARRIVE?• What are the outcomes?• Can we extract specific re-usable information?• Who are involved? (researchers, sponsors, patients?)• Has a proposed trial been completed and reported?

Text-based plugins

• Bag of words (https://en.wikipedia.org/wiki/Bag-of-words_model)

• https://en.wikipedia.org/wiki/Tf%E2%80%93idf (Term-frequency, inverse document frequency)• Templates and regexes (regular expressions).

https://en.wikipedia.org/wiki/Bag-of-words_model

https://en.wikipedia.org/wiki/Bag-of-words_model

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

“Bag of Words”

Three fulltext articles from trialsjournal.com

Regular Expressions for Systematic Reviews of Animal Tests

Preceding TextFollowing Text

Extracted term

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

Ln Bacterial load per fly

11.5

11.0

10.5

10.0

9.5

9.0

6.5

6.0

Days post—infection

0 1 2 3 4 5

Bitmap Image and Tesseract OCR

AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home

Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:

AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other

CLICK HERE FOR ANIMATION

(may be browser dependent)

http://dx.doi.org/10.3390/metabo2010100

https://bytebucket.org/petermr/xhtml2stm/wiki/animation.svg

https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/ for story of extraction

Thinning Topology

Serialization

Newick

https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/




Phytochemistry extraction

O. dayi

“volatile composition of “

A.sibeiri

A. judaica

Displayed by CAT (CottageLabs)

What we can do

• Recognize and promote autonomous sub-communities

• Engage Early Career Researchers, including undergraduates and let THEM BUILD the systems.

• COMMUNALLY build tools for data checking• Insist on semantic data input, even if it costs

submissions

contentmine.org team