Plazi:Prospects for Markup of Legacy and
New Taxonomic Literature
Terry CatapanoTDWG Fremantle, WA
October 21, 2008
NSF/DFG Grant (AMNH/University of Karlsruhe)XML Markup of taxonomic publications for extraction of:
Treatments Scientific Names Morphological Characters Distribution Data Collection locales/events
For: Open Access Submission to db's Retrieval Ontology development
Markup Languages Provides grammar to define document types Delineate & identify document elements (atoms) in text Syntax: Structural relationships between elements
(parent/child, cardinality, ordinality, id/idref, key/keyref) Beyond the PDF
TaxonX schemaGolden Gate Editor250 Docs/7500 TreatmentsDSpace-based Digital Object Repository (handles)SRSTAPIR (specimen data)Species Profile Model/RDF (descriptive data)
Wildly heterogeneousRequires lax structuring of documentsNeed for regularizationRequires editorial policy (reproduction: text of work or text of document) Defers much work of interoperabilityBenefits
Treatments +names, subsections, localities, bibliographic references
Extraction & representation in other services Costs
• GoldenGate configured for testbed: 3 minutes per page• $5 page(?)
New LiteratureDifferent markup activityDifferent markup activity
Prospective not RetrospectiveMore optimal cost/benefit ratio?
Strict modeling for consistent documents/data Increased regularization Increased sharing, re-use Decreased costs (potentially):
Application QC Adoption
TDWG Vocabularies supply many conceptsNLM Journal Archiving and Interchange Tag Suite
DTD's for markup of journal articles Archiving, Publishing, Authoring, other modules possible Wide adoption by publishers and aggregators; LOC Actively maintained
Module for taxonomic treatments in Publishing
Inherit generic features from existing Tag Set Bibliographic references Tables Linking supporting material/data (xlink) Linking to graphic and media objects (xlink)
TreatmentsTreatment sectionsScientific names, Geographic names, Characters/StatesSpecimens and other materials citations
Plazi: NLM conversion of Zootaxa and PLOS One articlesApply markup at earliest stage possibleDevelop tools to assist (probably easier than for “pure” legacy literature)Extend codes and structures to handle electronic publicationShifts
“illustrated narrative” complex digital objects
METS, OAI-ORE, MPEG-21/DIDL
Text
Materials Description
Treatment
ImageData
Nomenclature
Linked Data Machines > Documents > Data
Open documents, free dataReduced costs of use/re-use (e.g., SPM for EOL)Broaden scope of applicationAccelerate velocity of information exchange