mining scientific diagrams for facts

Post on 14-Feb-2017

530 Views

Category:

Science

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Mining Scientific ImagesPeter Murray-Rust,

Dept of Chemistry and TheContentMine

DAMTP, Cambridge, UK, 2016-01-27

contentmine.org is supported by a grant to PMR as a

The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011

http://contentmine.org

Output of scholarly publishing

[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg

586,364 Crossref DOIs 201507 [1] per month2.5 million (papers + supplemental data) /year [citation needed]*

each 3 mm thick 4500 m high per year [2] * Most is not Publicly readable[1] http://www.crossref.org/01company/crossref_indicators.html

Most Publishers destroy structured information (LaTeX, Word) into PDF …

• Characters (NOT words or higher structure) WORD is simply 4 characters, no space chars• Paths (NOT circles, squares …) “Vectors”

… APIs then destroy it further into Pixels (e.g. PNG or JPG )

Content Mine will read 10,000 PNGs a day and try to recover the science.

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

PMR is collaborating with the European Bioinformatics Institute to liberate all metabolic information from journals

Examples of plots

Multisegment diagram

Multisegment diagram

Whitespace “corridors”

SuperpixelBounding box

Semanticlabels

Chemistry in Patents

Obfuscation?

Chemical Computer Vision

Raw Mobile photo; problems:Shadows, contrast, noise, skew, clipping

BoofCV Operations

Low Level Image ProcessingBlur Different operations for smoothing/blurring images.Derivatives Shows the first and second order image derivatives.Contour Detects the contour/edges of objects inside an image.Denoising ways to remove noise from images, e.g. wavelet and blur filters.Interpolation Shows different interpolation algorithms scaling up an image.Binary Operations Different basic binary image operations.Remove lens distortionLinesOrientationShape FittingSuperpixels

Boofcv.org Open Source Java Library

Binarization (pixels = 0,1)

Irregular edges

Antialiased Original

Binarization

Colours – antialiasing and posterisation

Posterisation

Extracted since unique posterized colour

Canny edge detection

Erosion and Dilation• https://en.wikipedia.org/wiki/Mathematical_morphology

Erosion Opening

Dilation Closing

Dilation followed by erosion can remove small breaks, etc.

http://homepages.inf.ed.ac.uk/rbf/HIPR2/thin.htm

http://rosettacode.org/wiki/Zhang-Suen_thinning_algorithm

Algorithm Assume black pixels are one and white pixels zero, and that the input image is a rectangular N by M array of ones and zeroes. The algorithm operates on all black pixels P1 that can have eight neighbours. The neighbours are, in order, arranged as:

P9P2P3 P8P1P4 P7P6P5

Obviously the boundary pixels of the image cannot have the full eight neighbours. Define A ( P 1 ) { A(P1)} = the number of transitions from white to black, (0 -> 1) in the sequence P2,P3,P4,P5,P6,P7,P8,P9,P2. (Note the extra P2 at the end - it is circular).Define B ( P 1 ) {B(P1)} = The number of black pixel neighbours of P1. ( = sum(P2 .. P9) )Step 1 All pixels are tested and pixels satisfying all the following conditions (simultaneously) are just noted at this stage. (0) The pixel is black and has eight neighbours(1) 2 <= B ( P 1 ) <= 6 {2<=B(P1)<=6} (2) A(P1) = 1(3) At least one of P2 and P4 and P6 is white(4) At least one of P4 and P6 and P8 is whiteAfter iterating over the image and collecting all the pixels satisfying all step 1 conditions, all these condition satisfying pixels are set to white. Step 2 All pixels are again tested and pixels satisfying all the following conditions are just noted at this stage. (0) The pixel is black and has eight neighbours(1) 2 <= B ( P 1 ) <= 6 2<=B(P1)<=6} (2) A(P1) = 1(3) At least one of P2 and P4 and P8 is white(4) At least one of P2 and P6 and P8 is whiteAfter iterating over the image and collecting all the pixels satisfying all step 2 conditions, all these condition satisfying pixels are again set to white. Iteration If any pixels were set in this round of either step 1 or step 2 then all steps are repeated until no image pixels are so changed.

Zhang-Suen Thinning

Thinning: thick lines to 1-pixel

Chemical Optical Character Recognition

Small alphabet, clean typefaces, clear boundaries make this relatively tractable. Problems are “I” “O” etc.

Ln Bacterial load per fly

11.5

11.0

10.5

10.0

9.5

9.0

6.5

6.0

Days post—infection

0 1 2 3 4 5

Bitmap Image and Tesseract OCR

http://www.slideshare.net/rossmounce/the-pluto-project-ievobio-2014

Ross Mounce (Bath), Panton Fellow

• Sharing research data: http://www.slideshare.net/rossmounce • How-to figures from PLOS/One [link]:

Ross shows how to bring figures to life: • PLOSOne at http://bit.ly/PLOStrees • PLOS at http://bit.ly/phylofigs (demo)

4300 images

Note Jaggy and broken pixels

NEW Bacteria must have a phylogenetic tree

Length_________Weight Binomial Name Culture/Strain GENBANK ID

EvolutionRate

IJSEM phylotrees

• International Journal Systematic and Evolutionary Microbiology

• All new microorganisms are expected to be published there

• Consistent (though primitive) approach to trees

“Root”

OCR (Tesseract)

Norma (imageanalysis)

(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);

Semantic re-usable/computable output (ca 4 secs/image)

Automatic Open Notebook of computations

Everything is posted to Github before being analyzed

Bacillus subtilis [131238]*Bacteroides fragilis [221817]Brevibacillus brevisCyclobacterium marinumEscherichia coli [25419]Filobacillus milosensisFlectobacillus major [15809775]Flexibacter flexilis [15809789]Formosa algaeGelidibacter algens [16982233]Halobacillus halophilusLentibacillus salicampi [18345921]Octadecabacter arcticusPsychroflexus torquis [16988834]Pseudomonas aeruginosa [31856]Sagittula stellata [16992371]Salegentibacter salegensSphingobacterium spiritivorumTerrabacter tumescens

• [Identifier in Wikidata] • Missing = not found with Wikidata API

20 commonest organisms (in > 30 papers) in trees from IJSEM*

Half do not appear to be in Wikidata

Can the Wikipedia Scientists comment?

*Int. J. Syst. Evol. Microbiol.

Display your own tree• Cut and paste…• ((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),

((((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187),n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),(((n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n160))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222))))))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,(n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),((n53,n131),n159)))))));

• View with http://www.unc.edu/~bdmorris/treelib-js/demo.html or• http://www.trex.uqam.ca/index.php?action=newick&project= trex

Supertree for 924 species

Tree

Supertree created from 4300 papers

Plots

To be extracted: * Symbol(x,y) * Error bar (y+,y-) * Line

Yaxis• Extent

Neuroscience spike traces

Typical PDF with vectors - hyperlink

But we can now turn PDFs into

Science

We can’t turn a hamburger into a cow

Pixel => Path => Shape => Char => Word => Para => Document => SCIENCE

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

VECTOR PDF

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

C) What’s the problem with this spectrum?

Org. Lett., 2011, 13 (15), pp 4084–4087

Original thanks to ChemBark

After AMI2 processing…..

… AMI2 has detected a square

AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home

Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:

AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other

CLICK HERE FOR ANIMATION

(may be browser dependent)

Precision + Recall for ImageAnalysis?

• Chemical Patents (obfuscation) ca 25% PR• Binomial names from text > 99% PR• Binomial from images (lookup) 95%+ • Trees from images (pred.) • Molecules: image ca 90% SVG > • Analysis massively hampered by Copyright

Software Availability and collaboration

• All software OSI-compliant (non-GPL) Apache2 , MIT, BSD• http://bitbucket.org/wwmm, (euclid, Jumbo6, svg, pdf2svg, • http://bitbucket.org/petermr, svgbuilder, xhtml2stm,

imageanalysis, diagramanalyzer• http://bitbucket.org/AndyHowlett/ami2-poc• http://github.com/petermr/ami-plugin • http://github.com/ContentMine • http://boofcv.org • collaboration with PDFBox, TabulaPDF, JailbreakingThePDF

• Extracted data CC 0

Questions and comments

Thanks:• Andy Howlett, Dept Chemistry, Cambridge• Mark Williamson, Dept Chemistry, Cambridge• Ross Mounce, Biology, University of Bath• Shuttleworth Foundation

PM-R has offered to mentor an MSc project this summer for anyone interested.

contentmine.org

top related