an#idigbio# perspecveon# darwincore archives# · pdf filehip://icons. hip:// ,...
TRANSCRIPT
An iDigBio Perspec/ve on Darwin Core Archives Alex Thompson Andréa Matsunaga José Fortes Supported by NSF Award EF-‐1115210
2013 TDWG Conference
hIp://www.idigbio.org
Advanced Computing and Information Systems laboratory
iDigBio’s Use Case iDigBio is a distributed, schema-‐less database comprised largely of specimen and media records. This is a level of abstracSon away from most exisSng database, which seek to track the specimen or media itself. This may seem like a minor disSncSon, but It means that, for example, iDigBio doesn’t use DwC’s occurenceID, or Audubon Core’s dcterms:idenSfier field as our primary idenSfier. We either use a provided record idenSfier, or construct one from provided informaSon (ex. datasetID+occurrenceID).
hIp://icons.iconarchive.com/, hIp://www.cagrid.org, hIp://en.wikipedia.org
2
Advanced Computing and Information Systems laboratory
iDigBio’s Use Case (cont.) In pracSce, this means: � iDigBio can collect informaSon from a variety of sources about the same occurrenceID (or other idenSfier) � Eventually, we will be integraSng all of the available informaSon about all of the idenSfiers we know about into a single view
� Darwin Core Archives, as most people use them (specifically as generated by IPT) are somewhat cumbersome to use for our ideal use case � There is a non-‐trivial chance that the id field of the core file could be non-‐unique (due to the use case, or due to database mergers)… which would be fine in the iDigBio data model, but leaves us no effecSve way to communicate record idenSfiers (normally done with resource relaSonships)
3
Advanced Computing and Information Systems laboratory
DwC-‐A as a Format � Pros:
� Provides a fairly space efficient way to transmit data � Provides descripSve metadata for both the dataset (via EML) and for the data files themselves
� Properly declares file encodings – a huge issue for pure-‐text formats � Designed to be extensible
� Cons � Uses mulSple standards to represent informaSon
� zip, xml, tvs/csv, a variety of character encodings � A minor issue, but does increase the number of dependencies and
programming complexity � Perhaps not as prescripSve as it could be
� Meta.xml file’s locaSon within the zip file is not specified � Allows path’s within the zip file for data files, as well as other opSons like
urls for data files � Somewhat needlessly complicates building fully compliant implementaSons � Separator/format for fields with lists are loosely defined
hIp://icons.iconarchive.com/, hIp://www.cagrid.org, hIp://en.wikipedia.org
2
Advanced Computing and Information Systems laboratory
Tool Support for DwC-‐A � Pros:
� Strong support from GBIF with IPT, Validator, HarvesSng Toolkit, and Java Libraries
� In gaining acceptance to the point where most tool vendors have at least some path for gedng data out of a database in DwC-‐A (at least guidance on how to use IPT for export)
� Standard is simple enough that files can be generated by hand from other data sources.
� A handful of other open-‐source implementaSons – but none are as complete as GBIF’s � GNA has a ruby gem (dwc-‐archive) � Belgian Biodiversity Plaeorm has a python reader (
hIps://github.com/BelgianBiodiversityPlaeorm/python-‐dwca-‐reader) � Cons:
� IPT is strongly Sed to GBIF’s use case � Only supports Taxon and Occurrence as core � Extensions must be hosted by GBIF � AlternaSve, someSmes compeSng extensions
� Only one full implementaSon of the DwC-‐A Spec
5
Advanced Computing and Information Systems laboratory
DwC-‐A Standards � Pros:
� AcSve standards bodies (TDWG!) working to maintain and improve core informaSon standards on which the format relies
� Open and well documented standards process, clear lines of communicaSon with standards maintainers for implementaSon guidance
� Cons: � Most current standards acSviSes are focused on semanSc definiSons, not content definiSons (defined types other than strings).
hIp://icons.iconarchive.com/, hIp://www.cagrid.org, hIp://en.wikipedia.org
2
Advanced Computing and Information Systems laboratory
Challenges � Media Only CollecSons
� No direct support from IPT � Can create a stub specimen record to link to
� MulSple Specimens Per-‐Image � No direct support from IPT � Can use resource relaSonship � Further compounded with many-‐to-‐many relaSonships
� Record idenSfiers with non-‐existent or duplicate occurenceIDs � If not using IPT, can use a non-‐standard field � Could also potenSally use dynamicProperSes
hIp://icons.iconarchive.com/, hIp://www.cagrid.org, hIp://en.wikipedia.org
2
Advanced Computing and Information Systems laboratory
Challenges Cont.
hIp://icons.iconarchive.com/, hIp://www.cagrid.org, hIp://en.wikipedia.org
2
The MISC Data Model
Advanced Computing and Information Systems laboratory
Broader Barriers – IdenSfiers � The lack of strong standards for occurenceIDs presents difficulSes at every step. Same is true for other data types.
� Progress is being made though � Specify 6.5 added strong idenSfiers to everything � iDigBio is working with EMu user group to get idenSfiers into Emu
� Symbiota has added record idenSfiers to all their collecSons and is working on gedng the specimen idenSfiers from collecSons that have them.
� Tool providers and developers should start pushing for strong idenSfiers whenever possible
hIp://icons.iconarchive.com/, hIp://www.cagrid.org, hIp://en.wikipedia.org
2
Advanced Computing and Information Systems laboratory
Broader Barriers – Formadng � Much like idenSfiers, the lack of defined data formats can seriously hinder data use.
� Where possible standards bodies should reference well defined standards for fields
� Even without acSon from standards bodies, tool providers should at least incorporate the ability to reference known standards for formats � ISO 3166-‐1 alpha-‐3 for countries � ISO 8601 for dates � ISO 639-‐2 for languages � JSON for hashes or array fields
hIp://icons.iconarchive.com/, hIp://www.cagrid.org, hIp://en.wikipedia.org
2
Advanced Computing and Information Systems laboratory
What iDigBio Does Now � Specimens and Media
� Specimens as the Core � Media in an Audubon Core extension
� Linked via coreid
� Record IDs associated via resource relaSonship � Media Only
� CSV files with Audubon Core fields � No structure descripSon � No EML
hIp://icons.iconarchive.com/, hIp://www.cagrid.org, hIp://en.wikipedia.org
2
Advanced Computing and Information Systems laboratory
iDigBio’s near future plans � Expand the use of resource relaSonship, measurement or fact, and dynamic properSes to provide more ways to specify non-‐standard properSes as first-‐order properSes. � Probably move to using dynamicProperSes to provide record ids (only fixes Darwin core though).
� Build a full python implementaSon of a reader and writer that supports an arbitrary core type and any number of extensions. � Possibly ship core & extension schemas, when not available on the internet already, in the DwC-‐A
� Add an opSon to enforce or warn a uniqueness constraint on the core id field
hIp://icons.iconarchive.com/, hIp://www.cagrid.org, hIp://en.wikipedia.org
2
Advanced Computing and Information Systems laboratory
iDigBio’s suggesSons for DwC-‐A � Ditch / Minimize Core
� Replace the core concept with a staSc backbone schema that only includes � Dcterms:idenSfier (enforced to be locally unique, recommended to be
globably unique) � Dcterms:modified (in ISO 8601) � Dcterms:type � A deleted flag (to enable explicit delete signaling)
� Move all inter-‐type relaSonships to a relaSonships extension like resource relaSonship with both ends of the relaSonship poinSng at dcterms:idenSfier in the core (tools should enforce referenSal integrity) � Can do many-‐to-‐many with a groups extension and membership
relaSonships. � All data is now an extension to a minimal core
hIp://icons.iconarchive.com/, hIp://www.cagrid.org, hIp://en.wikipedia.org
2
Advanced Computing and Information Systems laboratory
DwC-‐A ImplementaSon WG � Form a mulS-‐insStuSon ImplementaSon working group with the goal of taking the current or a new standard and building full implementaSons (readers and writers) in mulSple languages � Common test specificaSons, reference files, etc. � Release all the implementaSons on a common github (or google code) repository so that they can be easily reused
hIp://icons.iconarchive.com/, hIp://www.cagrid.org, hIp://en.wikipedia.org
2
Advanced Computing and Information Systems laboratory
Thanks for listening � Special thanks to GBIF, TDWG, and the enSre community for laying a great foundaSon to build on.
� Also thanks to Tim Robertson, Aaron Steele, and the other aIendees of iDigBio’s IT standards workshop for all the valuable advice and steering us in the right direcSons.
hIp://icons.iconarchive.com/, hIp://www.cagrid.org, hIp://en.wikipedia.org
2