contentmine architecture

9
RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers Collaboration with Open Access

Upload: petermurrayrust

Post on 02-Aug-2015

95 views

Category:

Software


0 download

TRANSCRIPT

RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs

QueuesRepos

Scientificliterature

SciencePlugins

ScienceVolunteers

Collaboration with Open Access Button

quickscrapeCrawlFeed

Norma Index &Transform

PDF

XML

URL

DOI

Scientificliterature

Repositories DOC

CSV

sHTML

Plugins

Regex

SequencesSpecies

Bespoke

Scrapers

XPathPer-Journal

TaggersPer- Journal

MetadataChemistry

Phylogenetics Farming

AMI

BadHTML

OCR

Diagrams

Open NORMA-lized Scientific Literature + Facts

CANARY pipeline

CAT-alogue index

Starting points

• Search/Crawl/Feed-> PMCID,DOI,URL -> quickscrape -> CMDir(PDF,HTML,XML,images/,meta) -> Norma -> CMDir(sHTML|TXT|SVG) good

• PDF,XML,HTML -> Norma -> CMDir(PDF,rawHTML,TXT,images/,meta?) -> NormaOCR -> CMDir(sHTML,TXT,SVG) variable

Conversions

• Paper-> Scanned -> TIFF (avoid) • PDF,TIFF,PNG -> Tesseract-N -> HTML, SVG

fast, variable• PDF -> PDF2SVG-N -> sHTML, SVG, images/.

slow, accurate-ish• PDF -> PDF2TXT-N -> TXT fast, variable• PDF -> PDF2Image-N -> PNG fast, accurate

Raw HTMLNot wellformedBad charactersemantics

ScholarlyHTML

Well-formed XHTML

PNG

TaggedSections

CaptionedFigures

Tables

CaptionedTables

XMLHtmlTidyJsoupHtmlUnit

XSLT1/2

XSLT1/2

NORMA

Per-journalStylesheets

End points

• Norma -> CMDir(OpenSHTML-SVG) • Norma -> CMDir(sHTML. sections) -> AMI ->

all text + species, chemistry, sequences)• Norma -> CMDir(TXT (unsectioned)) ->

AMI -> bagOfWords, regex, • Norma -> CMDir(PNG) -> AMI -> phylo, bar/xy-

plots, • Norma -> CMDir(SVG) -> AMI -> phylo, bar/xy-

plots, chemistry

PDFNon-UnicodePixel glyphsNo wordsNo structures

ScholarlyHTML

SVG

High-levelgraphics

PDF2SVG

characters

SentencesParastables

PNG OCR

TaggedSections

SVGBuilder

CaptionedFigures

NORMA

XSLT1/2

NORMALIZE

NormaConvert PDF,XMLTo sHTMLTag sections

Normalized Scientific Literature

AMIIndexTransformExtractSearch

PDF2SVGXSL stylesheetsTaggers

normalizationParameters

“Permanent” Filestore

Temporary Filestore

Extracted factsindexes

PluginsRegex