20090921 art databanken agosti final

73
Literature and XML: or How to Have More Time to Think Donat Agosti Plazi ArtDataBanken Stockholm, Sept 21, 2009

Upload: agosti

Post on 10-May-2015

365 views

Category:

Education


0 download

DESCRIPTION

Lecture presented at the ArtDataBanken meeting in Stockholm on September 21, 2009

TRANSCRIPT

Page 1: 20090921 Art Databanken Agosti Final

Literature and XML:or How to Have More Time to

Think

Donat AgostiPlazi

ArtDataBanken Stockholm, Sept 21, 2009

Page 2: 20090921 Art Databanken Agosti Final

Who is this?What do I know about her?

Where does she live?

Who are you?What do you do?

Where are you from?

Page 3: 20090921 Art Databanken Agosti Final

The answers are inseveral hundred million pages of printed

species descriptions in our libraries, including the descriptions redescriptions of

an estimated 1.8M species, and an estimated 50K new

(re-)descriptions annually.

Page 4: 20090921 Art Databanken Agosti Final

Taxonomists at work ……

T. E. Lawrence: Seven Pillars of Wisdom – a triumph. 1st published for general circulation, 1935: p. 535

Page 5: 20090921 Art Databanken Agosti Final

The traditional flux of information …

…a more or less closed, intransient system

Page 6: 20090921 Art Databanken Agosti Final

What has this to do with XML,semantic, enhanced documents?

Page 7: 20090921 Art Databanken Agosti Final

Access

Page 8: 20090921 Art Databanken Agosti Final

Scanning

Pdf-conversion

(WWW)

Page 9: 20090921 Art Databanken Agosti Final

Before antbase.org, Harvard‘s Museum of Comparative Zoology could claim to be the only location with a complete set of ant systematics publications from 1758 - present.

Through antbase.org‘s digital library, access to this body of literature is worldwide, and it is actively used (>10,000 visits in one month only).

Page 10: 20090921 Art Databanken Agosti Final

The Biodiversity Heritage Library is currently digitizing and make accessible >100 million pages, most of them out of copyright, ie older then 1925. ........ to be finished in 2048...

Page 11: 20090921 Art Databanken Agosti Final

Access to ant taxonomic publications through antbase.org /Smithsonian Institution, including currently the entire body of non-copyrighted publications since 1758 (>4,000 publications or 85,000 pages)

Page 12: 20090921 Art Databanken Agosti Final

Can taxonomic work be copyrighted?

Copyright legislation is national but is based on the Berne Convention for the Protection of Literary and Artistic Works which defines a minimal standard. This international copyright standard does not require the recognition of treatments, the building stones of taxonomic publications, as works.

Page 13: 20090921 Art Databanken Agosti Final

“work” does not mean “text”, does not mean “data”, does not mean “information”. A “work” is something more. That kind of something more has many different definitions in the various legislations, but it is always there: It may be called originality, individuality, creation, personal expression, creative shaping or anyhow else, but it is a condition for qualifying a product as a work: “Work” is an intellectual product that is in a certain sense particular, individual, original, new. (Egloff: EDIT IPR and Copyright, 2008)

Page 14: 20090921 Art Databanken Agosti Final

Taxonomic treatments are highly structured and homogenous, part of a global >100 million page corpus growing at a rate of ca 20,000 new species descriptions per year, not counting 5 times more redescriptions. Its structure is tightly controlled by a peer review process enforcing standards, a domain specific vocabulary, not written as poem or in flowery language but scientific jargon.

Treatments do not qualify as work.

The publications including the treatments might.(Egloff: EDIT IPR and Copyright, 2008)

Page 15: 20090921 Art Databanken Agosti Final

It is about digesting millions of pages:

>>100 M pages taxonomic literature

25M scientific publications / year25K journals

>1K with taxonomic descriptions

20K descriptions of new species / year

Page 16: 20090921 Art Databanken Agosti Final

Is this is the access we need?!

Page 17: 20090921 Art Databanken Agosti Final

No, we need open access to content, not the PDF per se.

Page 18: 20090921 Art Databanken Agosti Final

It is about machines(not we) doing a great deal of the

work for us, extracting data, formulating hypothesis, ....

Page 19: 20090921 Art Databanken Agosti Final

It is about data and information in context

Page 20: 20090921 Art Databanken Agosti Final

„Nothing makes sense in biology except in the light of treatments“.

Page 21: 20090921 Art Databanken Agosti Final

An example from the Neurocommons text mining pilot:

• PubMed abstracts: > 16,000,000• CNS classified abstracts: 874,727• text mining recognized: 368,688• text mining processed: 94,381

• extracted graph of 30,000+ relationships and 5,500 genes and proteins “protein-protein

interaction networks” John Wilbanks, Neurocommons

Page 22: 20090921 Art Databanken Agosti Final
Page 23: 20090921 Art Databanken Agosti Final

In a semantic Web environment (where machines talk to each other and do most of our work), data need to be able to talk to each other:

27,266 papers

4,563 papers41,985 papers

10,365 papers

128,437 papers

“protein-protein interaction networks” John Wilbanks, Neurocommons

Page 24: 20090921 Art Databanken Agosti Final

Relational to Ontological Mapping

Drug

Neuron

PathologicalAgent

Receptor

Channel

inhibitsinhibits

Agent

NeuronalProperty

PathologicalChange

involvesinvolves inhibits

Compartment

has

is_located_in

is_located_in

slide courtesy of kei chung, yale

Page 25: 20090921 Art Databanken Agosti Final

It will open up scientific literature for data mining

“protein-protein interaction networks” John Wilbanks, Neurocommons

Page 26: 20090921 Art Databanken Agosti Final

TREATMENT Cremastogaster mimosae  Likely Diagnostically Related to: Cremastogaster tricolor  Likely Diagnostically Related to: Cremastogaster tricolor  Likely Diagnostically Related to: Cremastogaster amabilis  Likely Diagnostically Related to: Cremastogaster tricolor  Likely Diagnostically Related to: Cremastogaster amabilis  Associated with: Acacia sienocarpa Living in: Mombasa Lviing in: Tanga

Page 27: 20090921 Art Databanken Agosti Final

It is more: it is about access to the original or

source data

Page 28: 20090921 Art Databanken Agosti Final

The semantically enhanced treatments, extracted, stored on Plazi.org, and served in a human readable form, are linked to the underlying data: Fisher & Smith, 2008, PLoS ONE.

Page 29: 20090921 Art Databanken Agosti Final

Semantic, enhanced treatments do the job ...

Page 30: 20090921 Art Databanken Agosti Final

... and XML is one way to go.

Page 31: 20090921 Art Databanken Agosti Final

XML

XML stands for EXtensible Markup Language

XML is a markup language much like HTML

XML was designed to carry data, not to display data

XML tags are not predefined. You must define your own tags

(schema)

XML is designed to be self-descriptive

XML is a W3C Recommendation

Page 32: 20090921 Art Databanken Agosti Final

XML

Being open and non-proprietary XML is an optimal archival

format for the treatment/publication

Being a stable and rich data format, XML can be repurposed for

a variety of purposes

Page 33: 20090921 Art Databanken Agosti Final

XML

XML application design is an art in itself .... and thus can not be

explained in 15 minutes

Plenty of resources to dive into XML on Web, eg

http://www.w3schools.com/, etc.

Page 34: 20090921 Art Databanken Agosti Final

This means to develop a schema that models the logic content (e.g TaxonX), insert those tags that define what a word means, so a computer can understand as well. To assure, that everybody talks about the same species, the name can be linked to a reference name server

Azteca instabilisTaxonx-schema

Would then read like External schema

<tax:name><tax:xmldata> Normalization of data <dc:Genus>Azteca</dc:Genus> <dc:Species>instabilis</dc:Species> </tax:xmldata>

Azteca instabilis </tax:name>

Page 35: 20090921 Art Databanken Agosti Final

This can also be applied to entire sections of text, such as the treatment of a species and its parts.

<tax:treatment> <tax:nomenclature> <tax:name> <tax:xid source="HNS" identifier="193329"/> <tax:xmldata> <dc:Genus>Mystrium</dc:Genus> <dc:Species>leonie</dc:Species> </tax:xmldata> Mystrium leonie </tax:name> <tax:status>n. sp.</tax:status> Fig 1 D - F </tax:nomenclature> <tax:div type="description"> <tax:p>HOLOTYPE WORKER: TL 3.95, HL 1.02, HW 0.95, CI 93, SL 1.30, SI 137, PW 0.73, ML 0.38. Mandible outer margin strongly curving to a sharp apical tooth, the apex parallel to the anterior clypeal margin. (Holotype with material in mandibles, so mandibles and anterior clypeus $ described below from paratypes.) Median clypeus....</treatment>

Page 36: 20090921 Art Databanken Agosti Final

global unique identifiers (e.g. LSID) link up data

Page 37: 20090921 Art Databanken Agosti Final

LSID for scientific publicationsLSID for treatmentsLSID for names (Zoobank/ HNS..)LSID for specimensLSID for DNA sequences / characters (ontologies)LSID for repositoriesGPS fixes for locations

Page 38: 20090921 Art Databanken Agosti Final

Azteca instabilis

Would then read like

<tax:name><tax:xid source=“LSID" identifier=“urn:lsid:biosci.ohio-state.edu.osuc_concetps:13452"/> Link to external database <tax:xmldata> Normalization of data <dc:Genus>Azteca</dc:Genus> <dc:Species>instabilis</dc:Species> </tax:xmldata>

Azteca instabilis </tax:name>

Page 39: 20090921 Art Databanken Agosti Final

We need XML-schemas, tools to convert and expose

semantically enhanced documents.

Page 40: 20090921 Art Databanken Agosti Final

Plazi workflow: overviewPlazi deliverables

TaxonX XML schema

GoldenGate

Dspace application

Exist application

SRS

Exchange protocols (SPM, TAPIR, REST)

Page 41: 20090921 Art Databanken Agosti Final

- Get LSID from Hymenoptera Name Server for names; ZooBank?-Add new names

- Get bibliographic Metadata from HNS (MODS)

- Get bibliographic Guids from bioguid (or EDIT?)

- Get geographic long/lat from geonames.org

Plazi workflow: GoldenGate mark up as an example

-Get Guids for - CBOL- NCBI- specimen- images- .....

Page 42: 20090921 Art Databanken Agosti Final

Plazi Search and Retrieval Server: Access to data

TAPIR, SPM

You

You

You

human

machine

Page 43: 20090921 Art Databanken Agosti Final

Materials examined from literature in GBIF

Page 44: 20090921 Art Databanken Agosti Final

Plazi workflow: content

11,000 descriptions online500 publications4,500 publications

Handle, SPM and Tapir servicesFeeds into HNS and Zoobank (soon)Is harvested by GBIF, EOLSupport from GBIF, EOL, US-NSF, DFG

Page 45: 20090921 Art Databanken Agosti Final

Does the retro mark-up process scale up to the millions of pages needed to be processed?

Only partially: Mark up takes about 5min/page: For 100 M pages = 700 man years (but it is only a first tool...)

Page 46: 20090921 Art Databanken Agosti Final

Does the mark-up process scale up to the millions of page needed to be processed?

Only partially: Mark up takes about 5min/page: For 100 M pages = 700 man years (but it is only a first tool...); wizards can reduce the time by several factors

But: How much does it cost to digitize specimens, and what is its quality?

Page 47: 20090921 Art Databanken Agosti Final

The cost of converting legacy publications can be avoided by

producing marked-up publications up-front

Page 48: 20090921 Art Databanken Agosti Final

NLM/TaxonX schema allows publishers to maintain richly encoded articles whose data can be distributed and presented in multiple formats for a

variety of uses.

Page 49: 20090921 Art Databanken Agosti Final

NLM/Taxonx XML Document

Print

Page 50: 20090921 Art Databanken Agosti Final

NLM/Taxonx XML Document

PDF

Print

Page 51: 20090921 Art Databanken Agosti Final

NLM/Taxonx XML Document

HTML

PDF

Print

Page 52: 20090921 Art Databanken Agosti Final

NLM/Taxonx XML Document

HTML

SPM /RDF

PDF

Print

SPM /RDF

SPM /RDF

Page 53: 20090921 Art Databanken Agosti Final

NLM/Taxonx XML Document

HTML

SPM /RDF

PDF

Print

Database

HTML /Species Page

HTML

SPM /RDF

SPM /RDF

HTML /Species Page

HTML /Species PageEg. EOL, scrathpads

Page 54: 20090921 Art Databanken Agosti Final

NLM/Taxonx XML Document

HTML

SPM /RDF

PDF

Print

Database

HTML /Species Page

LSID resolver

HTML

SPM /RDF

SPM /RDF

HTML /Species Page

HTML /Species PageEg. EOL, scrathpads

Page 55: 20090921 Art Databanken Agosti Final

NLM/Taxonx XML Document

HTML

SPM /RDF

PDF

Print

Database

HTML /Species Page

Google

Dataminig, ...

LSID resolver

HTML

SPM /RDF

SPM /RDF

HTML /Species Page

HTML /Species PageEg. EOL, scrathpads

Page 56: 20090921 Art Databanken Agosti Final

Semi-automatically generated semantic, enhanced

e-publications are the only way to describe the missing 10 M species, and

to deal with an increasing flood of data.

Page 57: 20090921 Art Databanken Agosti Final

ms submission(„Taxon-x-version“)

new ms alertPosting for review

Edited ms

Revised msPublication: pdf

Publication: hard copy

Publication database(„taxon-x-version“)

ontology

bibliography

analysis & ms preparation

ZooBank / NS

Character DB

Specimen DB

Description DB

Distribution DB

Char. Matrix DB

Phyl. Tree DB

Char-state Im.

Specimen Im.

Habitat Image

Leg. Publicat.

Tax

on D

B

New Data

feedback

Accepted ms

New taxon alert

The future of publications: The publication semiautomaticall generated

Page 58: 20090921 Art Databanken Agosti Final

Word MS

DB

Input forms

export

export

convert NLM taxpubIndesign

NLM taxpub

author

author

author

publisher

publisher

publisher

Journal authoring and production workflow

Ctd.

Page 59: 20090921 Art Databanken Agosti Final

NLM/Taxonx XML Document

HTML

SPM /RDF

PDF

Print

Database

HTML /Species Page

Google

Dataminig, ...

LSID resolver

HTML

SPM /RDF

SPM /RDF

HTML /Species Page

HTML /Species PageEg. EOL, scrathpads

Ctd.

Page 60: 20090921 Art Databanken Agosti Final

Word MS

DB

Input forms

export

export

convert NLM taxpubIndesign

NLM taxpub

author

author

author

publisher

publisher

publisher

Journal authoring and production workflow:

What do we miss?

available

prototypes

to be developed

Page 61: 20090921 Art Databanken Agosti Final

Where do we stand?

2008LSIDs, external links

Page 62: 20090921 Art Databanken Agosti Final

Where do we stand?

2008LSIDs, external links, XML

Page 63: 20090921 Art Databanken Agosti Final

Where do we stand?

2008

Page 64: 20090921 Art Databanken Agosti Final

Where do we stand?

2009

LSIDs, external links, external data via doi, export services

Page 65: 20090921 Art Databanken Agosti Final

Where do we stand?

2009LSIDs, external links

Page 66: 20090921 Art Databanken Agosti Final

Recommendations:

Individual levelAssure that all you do is open access• Understand copyright – be not afraid of copyright• Self archive (the Green Road)• Create content for the Web

Page 67: 20090921 Art Databanken Agosti Final

Self archive (the Green Road): UNIZ as one of the global leaders in self archiving

Page 68: 20090921 Art Databanken Agosti Final

Recommendations:

Individual levelAssure that all you do is open access• Understand copyright – be not afraid of copyright• Self archive (the Green Road)• Don‘t sign any contracts giving away rights• Talk to your scientific societies and museum to adopt a policy to

at least allow self archiving• Demonstrate the power of access through inovative research

projects and data: Research will be the only motivation to change law and build up infrastructure

Page 69: 20090921 Art Databanken Agosti Final

OECD Declaration for Access to Research Data from Public Funding

(Spring 2007)

How to implement this?

Page 70: 20090921 Art Databanken Agosti Final

Recommendations:

• Assure that all you do is open access• Understand, adopt and propagate an adequate copyright policy• Talk to your scientific societies and museum to adopt a policy to

at least allow self archiving• Talk to your publishers to move into XML publishing• Support the emergence of standards and transfer protocols

Page 71: 20090921 Art Databanken Agosti Final

Recommendations: (ctd.)• Science policy has to change to build and maintain the

necessary cyberinfrastructure, similarly to the building of libraries

• Prospective publications must be structured to allow machines to read and understand them.

• Copyright must be adjusted to accomodate ingenious and best useage of our shared knowledge, such as using Creative Commons licencies or applying the principles of the Conservation Commons.

• Sharing data has to become standard practice between scientists

Page 72: 20090921 Art Databanken Agosti Final

Recommendations: (ctd.)• Science policy has to change to build and maintain the

necessary cyberinfrastructure, similarly to the building of libraries

• Prospective publications must be structured to allow machines to read and understand them.

• Copyright must be adjusted to accomodate ingenious and best useage of our shared knowledge, such as using Creative Commons licencies or applying the principles of the Conservation Commons.

• Sharing data has to become standard practice between scientists

antbase.org: Freier Zugang als Grundlage…

Page 73: 20090921 Art Databanken Agosti Final

http://plazi.org

Thank you very much!

Donat Agosti

[email protected]