Download - Mining Scientific Diagrams for facts
![Page 1: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/1.jpg)
Mining Scientific ImagesPeter Murray-Rust,
Dept of Chemistry and TheContentMine
DAMTP, Cambridge, UK, 2016-01-27
contentmine.org is supported by a grant to PMR as a
![Page 2: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/2.jpg)
The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011
http://contentmine.org
![Page 3: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/3.jpg)
Output of scholarly publishing
[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg
586,364 Crossref DOIs 201507 [1] per month2.5 million (papers + supplemental data) /year [citation needed]*
each 3 mm thick 4500 m high per year [2] * Most is not Publicly readable[1] http://www.crossref.org/01company/crossref_indicators.html
![Page 4: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/4.jpg)
Most Publishers destroy structured information (LaTeX, Word) into PDF …
• Characters (NOT words or higher structure) WORD is simply 4 characters, no space chars• Paths (NOT circles, squares …) “Vectors”
… APIs then destroy it further into Pixels (e.g. PNG or JPG )
Content Mine will read 10,000 PNGs a day and try to recover the science.
![Page 5: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/5.jpg)
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRYTEXT
MATH
contentmine.org tackles these
![Page 6: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/6.jpg)
PMR is collaborating with the European Bioinformatics Institute to liberate all metabolic information from journals
![Page 7: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/7.jpg)
Examples of plots
![Page 8: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/8.jpg)
Multisegment diagram
![Page 9: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/9.jpg)
Multisegment diagram
Whitespace “corridors”
SuperpixelBounding box
Semanticlabels
![Page 10: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/10.jpg)
![Page 11: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/11.jpg)
![Page 12: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/12.jpg)
Chemistry in Patents
Obfuscation?
![Page 13: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/13.jpg)
Chemical Computer Vision
Raw Mobile photo; problems:Shadows, contrast, noise, skew, clipping
![Page 14: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/14.jpg)
BoofCV Operations
Low Level Image ProcessingBlur Different operations for smoothing/blurring images.Derivatives Shows the first and second order image derivatives.Contour Detects the contour/edges of objects inside an image.Denoising ways to remove noise from images, e.g. wavelet and blur filters.Interpolation Shows different interpolation algorithms scaling up an image.Binary Operations Different basic binary image operations.Remove lens distortionLinesOrientationShape FittingSuperpixels
Boofcv.org Open Source Java Library
![Page 15: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/15.jpg)
https://en.wikipedia.org/wiki/Otsu's_method
Thresholding(Binarization)
https://en.wikipedia.org/wiki/Thresholding_%28image_processing%29
![Page 16: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/16.jpg)
Binarization (pixels = 0,1)
Irregular edges
![Page 17: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/17.jpg)
Antialiased Original
Binarization
![Page 18: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/18.jpg)
Colours – antialiasing and posterisation
![Page 19: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/19.jpg)
![Page 20: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/20.jpg)
Posterisation
Extracted since unique posterized colour
![Page 21: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/21.jpg)
Canny edge detection
![Page 22: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/22.jpg)
Erosion and Dilation• https://en.wikipedia.org/wiki/Mathematical_morphology
Erosion Opening
Dilation Closing
Dilation followed by erosion can remove small breaks, etc.
![Page 23: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/23.jpg)
http://homepages.inf.ed.ac.uk/rbf/HIPR2/thin.htm
http://rosettacode.org/wiki/Zhang-Suen_thinning_algorithm
Algorithm Assume black pixels are one and white pixels zero, and that the input image is a rectangular N by M array of ones and zeroes. The algorithm operates on all black pixels P1 that can have eight neighbours. The neighbours are, in order, arranged as:
P9P2P3 P8P1P4 P7P6P5
Obviously the boundary pixels of the image cannot have the full eight neighbours. Define A ( P 1 ) { A(P1)} = the number of transitions from white to black, (0 -> 1) in the sequence P2,P3,P4,P5,P6,P7,P8,P9,P2. (Note the extra P2 at the end - it is circular).Define B ( P 1 ) {B(P1)} = The number of black pixel neighbours of P1. ( = sum(P2 .. P9) )Step 1 All pixels are tested and pixels satisfying all the following conditions (simultaneously) are just noted at this stage. (0) The pixel is black and has eight neighbours(1) 2 <= B ( P 1 ) <= 6 {2<=B(P1)<=6} (2) A(P1) = 1(3) At least one of P2 and P4 and P6 is white(4) At least one of P4 and P6 and P8 is whiteAfter iterating over the image and collecting all the pixels satisfying all step 1 conditions, all these condition satisfying pixels are set to white. Step 2 All pixels are again tested and pixels satisfying all the following conditions are just noted at this stage. (0) The pixel is black and has eight neighbours(1) 2 <= B ( P 1 ) <= 6 2<=B(P1)<=6} (2) A(P1) = 1(3) At least one of P2 and P4 and P8 is white(4) At least one of P2 and P6 and P8 is whiteAfter iterating over the image and collecting all the pixels satisfying all step 2 conditions, all these condition satisfying pixels are again set to white. Iteration If any pixels were set in this round of either step 1 or step 2 then all steps are repeated until no image pixels are so changed.
Zhang-Suen Thinning
![Page 24: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/24.jpg)
Thinning: thick lines to 1-pixel
![Page 25: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/25.jpg)
Vectorization of line segments
NodesNon-node
Segmentation of one edge into 4 lines
Douglas-Peuckersegmentation algorithm
Fully thinned binary image
![Page 26: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/26.jpg)
Chemical Optical Character Recognition
Small alphabet, clean typefaces, clear boundaries make this relatively tractable. Problems are “I” “O” etc.
![Page 27: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/27.jpg)
Ln Bacterial load per fly
11.5
11.0
10.5
10.0
9.5
9.0
6.5
6.0
Days post—infection
0 1 2 3 4 5
Bitmap Image and Tesseract OCR
![Page 28: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/28.jpg)
http://www.slideshare.net/rossmounce/the-pluto-project-ievobio-2014
![Page 29: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/29.jpg)
![Page 30: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/30.jpg)
Ross Mounce (Bath), Panton Fellow
• Sharing research data: http://www.slideshare.net/rossmounce • How-to figures from PLOS/One [link]:
Ross shows how to bring figures to life: • PLOSOne at http://bit.ly/PLOStrees • PLOS at http://bit.ly/phylofigs (demo)
![Page 31: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/31.jpg)
4300 images
![Page 32: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/32.jpg)
Note Jaggy and broken pixels
NEW Bacteria must have a phylogenetic tree
Length_________Weight Binomial Name Culture/Strain GENBANK ID
EvolutionRate
![Page 33: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/33.jpg)
https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/ for story of extraction
Thinning Topology
Serialization
Newick
![Page 34: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/34.jpg)
IJSEM phylotrees
• International Journal Systematic and Evolutionary Microbiology
• All new microorganisms are expected to be published there
• Consistent (though primitive) approach to trees
![Page 35: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/35.jpg)
“Root”
![Page 36: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/36.jpg)
OCR (Tesseract)
Norma (imageanalysis)
(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);
Semantic re-usable/computable output (ca 4 secs/image)
![Page 37: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/37.jpg)
![Page 38: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/38.jpg)
Automatic Open Notebook of computations
Everything is posted to Github before being analyzed
![Page 39: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/39.jpg)
Bacillus subtilis [131238]*Bacteroides fragilis [221817]Brevibacillus brevisCyclobacterium marinumEscherichia coli [25419]Filobacillus milosensisFlectobacillus major [15809775]Flexibacter flexilis [15809789]Formosa algaeGelidibacter algens [16982233]Halobacillus halophilusLentibacillus salicampi [18345921]Octadecabacter arcticusPsychroflexus torquis [16988834]Pseudomonas aeruginosa [31856]Sagittula stellata [16992371]Salegentibacter salegensSphingobacterium spiritivorumTerrabacter tumescens
• [Identifier in Wikidata] • Missing = not found with Wikidata API
20 commonest organisms (in > 30 papers) in trees from IJSEM*
Half do not appear to be in Wikidata
Can the Wikipedia Scientists comment?
*Int. J. Syst. Evol. Microbiol.
![Page 40: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/40.jpg)
Display your own tree• Cut and paste…• ((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),
((((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187),n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),(((n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n160))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222))))))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,(n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),((n53,n131),n159)))))));
• View with http://www.unc.edu/~bdmorris/treelib-js/demo.html or• http://www.trex.uqam.ca/index.php?action=newick&project= trex
![Page 41: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/41.jpg)
Supertree for 924 species
Tree
![Page 42: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/42.jpg)
Supertree created from 4300 papers
![Page 43: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/43.jpg)
Plots
![Page 44: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/44.jpg)
![Page 45: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/45.jpg)
To be extracted: * Symbol(x,y) * Error bar (y+,y-) * Line
Yaxis• Extent
![Page 46: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/46.jpg)
![Page 47: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/47.jpg)
![Page 48: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/48.jpg)
Neuroscience spike traces
![Page 49: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/49.jpg)
![Page 50: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/50.jpg)
![Page 51: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/51.jpg)
Typical PDF with vectors - hyperlink
![Page 52: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/52.jpg)
But we can now turn PDFs into
Science
We can’t turn a hamburger into a cow
Pixel => Path => Shape => Char => Word => Para => Document => SCIENCE
![Page 53: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/53.jpg)
UNITS
TICKS
QUANTITYSCALE
TITLES
DATA!!2000+ points
VECTOR PDF
![Page 54: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/54.jpg)
Dumb PDF
CSV
SemanticSpectrum
2nd Derivative
Smoothing Gaussian Filter
Automaticextraction
![Page 55: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/55.jpg)
![Page 56: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/56.jpg)
![Page 57: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/57.jpg)
C) What’s the problem with this spectrum?
Org. Lett., 2011, 13 (15), pp 4084–4087
Original thanks to ChemBark
![Page 58: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/58.jpg)
After AMI2 processing…..
… AMI2 has detected a square
![Page 59: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/59.jpg)
![Page 60: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/60.jpg)
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
![Page 61: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/61.jpg)
Precision + Recall for ImageAnalysis?
• Chemical Patents (obfuscation) ca 25% PR• Binomial names from text > 99% PR• Binomial from images (lookup) 95%+ • Trees from images (pred.) • Molecules: image ca 90% SVG > • Analysis massively hampered by Copyright
![Page 62: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/62.jpg)
Software Availability and collaboration
• All software OSI-compliant (non-GPL) Apache2 , MIT, BSD• http://bitbucket.org/wwmm, (euclid, Jumbo6, svg, pdf2svg, • http://bitbucket.org/petermr, svgbuilder, xhtml2stm,
imageanalysis, diagramanalyzer• http://bitbucket.org/AndyHowlett/ami2-poc• http://github.com/petermr/ami-plugin • http://github.com/ContentMine • http://boofcv.org • collaboration with PDFBox, TabulaPDF, JailbreakingThePDF
• Extracted data CC 0
![Page 63: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/63.jpg)
Questions and comments
Thanks:• Andy Howlett, Dept Chemistry, Cambridge• Mark Williamson, Dept Chemistry, Cambridge• Ross Mounce, Biology, University of Bath• Shuttleworth Foundation
PM-R has offered to mentor an MSc project this summer for anyone interested.
contentmine.org
![Page 64: Mining Scientific Diagrams for facts](https://reader034.vdocuments.us/reader034/viewer/2022042722/58a206311a28ab40098b4dcd/html5/thumbnails/64.jpg)