architecture of contentmine components contentmine.org

Architecture of TheContentMine

These slides are for enlightenment and presentations. Use http://discuss.contentmine.org/t/overall-architecture/142 for up-to-date info. Questions, comments and critiques welcome! All s/w is Open (BSD/Apache2)

Some diagrams are autogenerated from *.dot files which are located in the projects (mainly Norma and AMI)

catalogue

getpapers

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

quickscrape

normaNormalizerStructurerSemanticTagger

DataFigures

UNIVRepos

search

LookupCONTENTMINING

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Latest 20150908

quickscrape Norma Index &Transform

Plugins

SequencesSpecies

BespokeScrapers XPath

Taggers

Per- Journal

Chemistry

Phylogenetics Plants

BadHTML

Diagrams

CAT-alogue index

getpapersquery

Titles+ links

DailyCrawl/feed

Latest 20150908; limited in scope

Starting points for ingestion(getpapers/quickscrape/Norma)

• Search/Crawl/Feed-> PMCID,DOI,URL -> quickscrape -> CTree(PDF,HTML,XML,images/,meta) -> Norma -> CMDir(sHTML|TXT|SVG|image) good

• PDF,XML,TXT,HTML -> Norma -> CTree(PDF,rawHTML,TXT,images/,meta?) -> NormaOCR|TXT2HTML -> CTree(sHTML,TXT,SVG) variable

20150908

Norma Conversions

• Paper-> Scanned -> TIFF (avoid) • PDF,TIFF,PNG -> Tesseract-N -> HTML, SVG

fast, variable• PDF -> PDF2SVG-N -> sHTML, SVG, images/.

slow, accurate-ish• PDF -> PDF2TXT-N -> TXT fast, variable• PDF -> PDF2Image-N -> PNG fast, accurate

20150908

Norma End points

• Norma -> CTree(OpenSHTML-SVG) -> everything?• Norma -> CTree(sHTML. sections) -> AMI -> all

text + species, chemText, sequences)• Norma -> CTree(TXT (unsectioned)) -> AMI ->

bagOfWords, regex, IDs, species?• Norma -> CTree(PNG) -> AMI -> phylo, bar/xy-

plots, • Norma -> CTree(SVG) -> AMI -> phylo, bar/xy-

plots, chemistry

Pre/early Norma toolchainTransforming PDF and PNG into higher value components

20150908Diagram autogenerated from *.dot graph

getpapers/quickscrape/Norma workflow

Getpapers/quickscrape/Norma: commonest uses

AMI: inputs and outputs for common plugins

Earlier diagrams

Probably significantly out of date, but may contain useful info.

NORMALIZE

NormaConvert PDF,XMLTo sHTMLTag sections

Normalized Scientific Literature

AMIIndexTransformExtractSearch

PDF2SVGXSL stylesheetsTaggers

normalizationParameters

“Permanent” Filestore

Temporary Filestore

Extracted factsindexes

PluginsRegex

PDFNon-UnicodePixel glyphsNo wordsNo structures

ScholarlyHTML

High-levelgraphics

PDF2SVG

characters

SentencesParastables

PNG OCR

TaggedSections

SVGBuilder

CaptionedFigures

XSLT1/2

Raw HTMLNot wellformedBad charactersemantics

ScholarlyHTML

Well-formed XHTML

TaggedSections

CaptionedFigures

Tables

CaptionedTables

XMLHtmlTidyJsoupHtmlUnit

XSLT1/2

Per-journalStylesheets

RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs

QueuesRepos

Scientificliterature

SciencePlugins

ScienceVolunteers

Collaboration with Open Access Button

quickscrapeCrawlFeed Norma Index &

Transform

TXTXML

Scientificliterature

Repositories DOC

PluginsRegex

SequencesSpecies

Bespoke

ScrapersXPathPer-Journal

TaggersPer- Journal

MetadataChemistry

Phylogenetics Farming

BadHTML

Diagrams

Open NORMA-lized Scientific Literature + Facts

CANARY pipeline

CAT-alogue index

architecture of contentmine components contentmine.org

Science

contentmine and wikidata

islamic financial architecture: key components and framework

fortinet secure sd-wan architecture components€¦ ·...

over view, architecture main components

android : architecture & components

the components of togaf architecture

iot meta-architecture, components, and benchmarking

unicore architecture and server components

software architecture taxonomies - behaviour: components &...

1 architecture mis 5003. 2 architecture components and the...

five&components&of&acomputer& components… ·...

zadak solution architecture components (1)

introduction to android, architecture & components

specification of the neurolog architecture components...

national its architecture: components and subsystems

architecture rajesh. components of database engine

contentmine: liberating scholarship from open publications...

architecture, voltage and components for a turboelectric

object storage architecture guide - dell technologies ·...

oracle architecture components