literature & interoperability: a working example using ants donat agosti, terry catapano, guido...

26
Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG 2007, Bratislava

Upload: samantha-hodges

Post on 19-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperability: a working example using ants

Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson

TDWG 2007, Bratislava

Page 2: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Participating organization

Main support by US-NSF, German DFG

Page 3: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Biodiversity monitoring, or what‘s out there?

Measuring and monitoring biodiversity means standard repetitive samples:Access to taxonomic data is the main impediment to run succesful surveys and to integrate survey into mainstream conservation, potentially one of the biggest user of taxonomic data

The question is: How can we provide the fastest way this content? What is doable, and what not?

Page 4: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperability

http://www.blsalptransit.ch/en/frameset_e.htm

A report from a break through in a long tunnel....

Page 5: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperabilityA report from a break through in a long tunnel....

For the first time, the entire production chain of ocr-ing, marking up, adding all the guids to produce a valid taxonx document is in place

We can provide a stable of encoded data/metadata which other applications can utilize (e.g. semant/iSpecies)

Page 6: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG
Page 7: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperability

Plazi.org• Sandbox and data provider• The principle: community involvement• Develop tools and solutions to access literature, both retrospective

and prospective literature• Make content available through exporting data into dedicated

databases• Provide an example of an input facility for Zoobank • Get around copyright by focusing on content by marking up

documents

• Explore digital taxonomic literature „Arxiv“• Drupal based with underlying DSpace repository and handle server

Page 8: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperability

Plazi workflow

Page 9: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperabilityPlazi products OCR-ed texts (dirty, clean)

ABBYY training files for fontsABBYY training files for journalsABBYY custom dictionary

Page 10: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperabilityGoldenGATEinteractions

- Get Guid from Hymenoptera Name Server for names-Add new names

Terminology follows ITIS; currently upload into Hymenoptera Name Server; query via html.

Page 11: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperabilityGoldenGATEinteractions

- Get Guid from Hymenoptera Name Server for names; ZooBank?-Add new names

- Get bibliographic Metadata from HNS (MODS)

- Get bibliographic Guids from bioguid

- Get geographic long/lat from geonames.org

Page 12: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperability• Products (1): documents

pdf, xslt-html, xml

Get one with pdf, xml

Pdf (original or scanned)

Html via XSLT

XML Taxonx

All documents with Guids: minimally Names, mods; max. bib.refs, specimen, localities

Page 13: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperability

Plazi workflow

Page 14: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperability• Products (2): Search and Retrieval Server

Page 15: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperabilitySearch and Retrieval Server: Output

Page 16: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperabilitySearch and Retrieval Server: Output

Page 17: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperabilitySearch and Retrieval Server: Output

Page 18: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperabilitySearch and Retrieval Server: Output

Page 19: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Search and Retrieval Server: Output

Page 20: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Products: What content do we have in store?• Goldstandard: 120+ taxonomic publications from

Madagascar, ranging from 1758-2007 (70% completed) (vertical)

• Recent publications continually added (horizontal standard)

• Series of publications describing elements of Taxonx, GoldenGATE, name finding algorithms (FindIT, FAT), compare approaches

• Increasing library of training files for ABBYY and analyzers for GoldenGATE

Literature & interoperability

Page 21: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Additionall products

• Training course for literature mark up to get the community involved

• Creating a neotropical catalogue of the ants using mark-up approach

• Development of metrics to measure mark-up production to optimize output for users (ecologists, taxonomists, etc.)

Literature & interoperability

Page 22: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperability

Ann. Soc. Entomol. Belg.

0

1

2

3

4

5

6

7

3961

3967

3956

3954

3855

3686

3920

3923

3712

3953

3786

3723

4001

4018

3715

3940

4022

4026

8070

HNS ID

min

Time per minute to produce clean OCR using ABBYY; publications in chronological order

Producing metrics to measure effort and compare various approaches and alogrithm

Page 23: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Literature & interoperability

1

10

100

1000

time/ character x 1000

characters/ 1000

markup time based on no. of characters / document

time/character x 1000 characters/1000

Time used to mark up documents in Taxonx in comparison to the number of pages per volume. Chronologica order

Producing metrics to measure effort and compare various approaches and alogrithm to mark up documents

Page 24: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

Additionall products• Training course for literature mark up to

get the community involved• Creating a neotropical catalogue of the

ants using mark-up approach

• Development of metrics to measure mark-up production to optimize output for users (ecologists, taxonomists, etc.)

• Experience: mark up is expensive....

Literature & interoperability

Page 25: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

pdfprint

Print + catalogueV

alue

for

sci

entis

t

image ocrclean

pdf/ocr struct.xml

semantxml

semantxml high

ocrdirty

s-xmllinked

data-base

cost

sLiterature & interoperability

?

How to best invest into the digitization of legacy publication?

NamesMarked-

up

treatmentsmarked-up

Finer grained mark up

Page 26: Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG

ms submission(„Taxon-x-version“)

new ms alertPosting for review

Edited ms

Revised msPublication: pdf

Publication: hard copy

Publication database(„taxon-x-version“)

ontology

bibliography

analysis & ms preparation

ZooBank / NS

Character DB

Specimen DB

Description DB

Distribution DB

Char. Matrix DB

Phyl. Tree DB

Char-state Im.

Specimen Im.

Habitat Image

Leg. Publicat.

Tax

on D

B

New Data

feedback

Accepted ms

New taxon alert

….. to the Future of Publication: publication as a version control instrument