an#idigbio# perspecveon# darwincore archives# · pdf filehip://icons. hip:// ,...

15
An iDigBio Perspec/ve on Darwin Core Archives Alex Thompson Andréa Matsunaga José Fortes Supported by NSF Award EF1115210 2013 TDWG Conference hIp://www.idigbio.org

Upload: doananh

Post on 22-Feb-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An#iDigBio# Perspecveon# DarwinCore Archives# · PDF filehIp://icons.  hIp:// , hIp://en.wikipedia.org%% 2. Author: Alex Thompson Created Date: 10/30/2013 3:16:50 PM

An  iDigBio  Perspec/ve  on  Darwin  Core  Archives  Alex  Thompson  Andréa  Matsunaga  José  Fortes    Supported  by  NSF  Award  EF-­‐1115210  

 

2013  TDWG  Conference  

hIp://www.idigbio.org  

Page 2: An#iDigBio# Perspecveon# DarwinCore Archives# · PDF filehIp://icons.  hIp:// , hIp://en.wikipedia.org%% 2. Author: Alex Thompson Created Date: 10/30/2013 3:16:50 PM

Advanced Computing and Information Systems laboratory

iDigBio’s  Use  Case  iDigBio  is  a  distributed,  schema-­‐less  database  comprised  largely  of  specimen  and  media  records.    This  is  a  level  of  abstracSon  away  from  most  exisSng  database,  which  seek  to  track  the  specimen  or  media  itself.    This  may  seem  like  a  minor  disSncSon,  but  It  means  that,  for  example,  iDigBio  doesn’t  use  DwC’s  occurenceID,  or  Audubon  Core’s  dcterms:idenSfier  field  as  our  primary  idenSfier.  We  either  use  a  provided  record  idenSfier,  or  construct  one  from  provided  informaSon  (ex.  datasetID+occurrenceID).  

hIp://icons.iconarchive.com/,  hIp://www.cagrid.org,  hIp://en.wikipedia.org    

2  

Page 3: An#iDigBio# Perspecveon# DarwinCore Archives# · PDF filehIp://icons.  hIp:// , hIp://en.wikipedia.org%% 2. Author: Alex Thompson Created Date: 10/30/2013 3:16:50 PM

Advanced Computing and Information Systems laboratory

iDigBio’s  Use  Case  (cont.)  In  pracSce,  this  means:    �  iDigBio  can  collect  informaSon  from  a  variety  of  sources  about  the  same  occurrenceID  (or  other  idenSfier)  �  Eventually,  we  will  be  integraSng  all  of  the  available  informaSon  about  all  of  the  idenSfiers  we  know  about  into  a  single  view  

�  Darwin  Core  Archives,  as  most  people  use  them  (specifically  as  generated  by  IPT)  are  somewhat  cumbersome  to  use  for  our  ideal  use  case  �  There  is  a  non-­‐trivial  chance  that  the  id  field  of  the  core  file  could  be  non-­‐unique  (due  to  the  use  case,  or  due  to  database  mergers)…  which  would  be  fine  in  the  iDigBio  data  model,  but  leaves  us  no  effecSve  way  to  communicate  record  idenSfiers  (normally  done  with  resource  relaSonships)  

3  

Page 4: An#iDigBio# Perspecveon# DarwinCore Archives# · PDF filehIp://icons.  hIp:// , hIp://en.wikipedia.org%% 2. Author: Alex Thompson Created Date: 10/30/2013 3:16:50 PM

Advanced Computing and Information Systems laboratory

DwC-­‐A  as  a  Format  �  Pros:  

�  Provides  a  fairly  space  efficient  way  to  transmit  data  �  Provides  descripSve  metadata  for  both  the  dataset  (via  EML)  and  for  the  data  files  themselves  

�  Properly  declares  file  encodings  –  a  huge  issue  for  pure-­‐text  formats  �  Designed  to  be  extensible  

�  Cons  �  Uses  mulSple  standards  to  represent  informaSon  

�  zip,  xml,  tvs/csv,  a  variety  of  character  encodings  �  A  minor  issue,  but  does  increase  the  number  of  dependencies  and  

programming  complexity  �  Perhaps  not  as  prescripSve  as  it  could  be  

�  Meta.xml  file’s  locaSon  within  the  zip  file  is  not  specified  �  Allows  path’s  within  the  zip  file  for  data  files,  as  well  as  other  opSons  like  

urls  for  data  files  �  Somewhat  needlessly  complicates  building  fully  compliant  implementaSons  �  Separator/format  for  fields  with  lists  are  loosely  defined  

hIp://icons.iconarchive.com/,  hIp://www.cagrid.org,  hIp://en.wikipedia.org    

2  

Page 5: An#iDigBio# Perspecveon# DarwinCore Archives# · PDF filehIp://icons.  hIp:// , hIp://en.wikipedia.org%% 2. Author: Alex Thompson Created Date: 10/30/2013 3:16:50 PM

Advanced Computing and Information Systems laboratory

Tool  Support  for  DwC-­‐A  �  Pros:  

�  Strong  support  from  GBIF  with  IPT,  Validator,  HarvesSng  Toolkit,  and  Java  Libraries  

�  In  gaining  acceptance  to  the  point  where  most  tool  vendors  have  at  least  some  path  for  gedng  data  out  of  a  database  in  DwC-­‐A  (at  least  guidance  on  how  to  use  IPT  for  export)  

�  Standard  is  simple  enough  that  files  can  be  generated  by  hand  from  other  data  sources.  

�  A  handful  of  other  open-­‐source  implementaSons  –  but  none  are  as  complete  as  GBIF’s  �  GNA  has  a  ruby  gem  (dwc-­‐archive)  �  Belgian  Biodiversity  Plaeorm  has  a  python  reader  (

hIps://github.com/BelgianBiodiversityPlaeorm/python-­‐dwca-­‐reader)  �  Cons:  

�  IPT  is  strongly  Sed  to  GBIF’s  use  case  �  Only  supports  Taxon  and  Occurrence  as  core  �  Extensions  must  be  hosted  by  GBIF  �  AlternaSve,  someSmes  compeSng  extensions  

�  Only  one  full  implementaSon  of  the  DwC-­‐A  Spec  

5  

Page 6: An#iDigBio# Perspecveon# DarwinCore Archives# · PDF filehIp://icons.  hIp:// , hIp://en.wikipedia.org%% 2. Author: Alex Thompson Created Date: 10/30/2013 3:16:50 PM

Advanced Computing and Information Systems laboratory

DwC-­‐A  Standards  �  Pros:  

�  AcSve  standards  bodies  (TDWG!)  working  to  maintain  and  improve  core  informaSon  standards  on  which  the  format  relies  

�  Open  and  well  documented  standards  process,  clear  lines  of  communicaSon  with  standards  maintainers  for  implementaSon  guidance  

�  Cons:  � Most  current  standards  acSviSes  are  focused  on  semanSc  definiSons,  not  content  definiSons  (defined  types  other  than  strings).  

hIp://icons.iconarchive.com/,  hIp://www.cagrid.org,  hIp://en.wikipedia.org    

2  

Page 7: An#iDigBio# Perspecveon# DarwinCore Archives# · PDF filehIp://icons.  hIp:// , hIp://en.wikipedia.org%% 2. Author: Alex Thompson Created Date: 10/30/2013 3:16:50 PM

Advanced Computing and Information Systems laboratory

Challenges    � Media  Only  CollecSons  

�  No  direct  support  from  IPT  �  Can  create  a  stub  specimen  record  to  link  to  

� MulSple  Specimens  Per-­‐Image  �  No  direct  support  from  IPT  �  Can  use  resource  relaSonship  �  Further  compounded  with  many-­‐to-­‐many  relaSonships  

�  Record  idenSfiers  with  non-­‐existent  or  duplicate  occurenceIDs  �  If  not  using  IPT,  can  use  a  non-­‐standard  field  �  Could  also  potenSally  use  dynamicProperSes  

 hIp://icons.iconarchive.com/,  hIp://www.cagrid.org,  hIp://en.wikipedia.org    

2  

Page 8: An#iDigBio# Perspecveon# DarwinCore Archives# · PDF filehIp://icons.  hIp:// , hIp://en.wikipedia.org%% 2. Author: Alex Thompson Created Date: 10/30/2013 3:16:50 PM

Advanced Computing and Information Systems laboratory

Challenges  Cont.    

hIp://icons.iconarchive.com/,  hIp://www.cagrid.org,  hIp://en.wikipedia.org    

2  

The  MISC  Data  Model  

Page 9: An#iDigBio# Perspecveon# DarwinCore Archives# · PDF filehIp://icons.  hIp:// , hIp://en.wikipedia.org%% 2. Author: Alex Thompson Created Date: 10/30/2013 3:16:50 PM

Advanced Computing and Information Systems laboratory

Broader  Barriers  –  IdenSfiers  �  The  lack  of  strong  standards  for  occurenceIDs  presents  difficulSes  at  every  step.  Same  is  true  for  other  data  types.  

�  Progress  is  being  made  though  �  Specify  6.5  added  strong  idenSfiers  to  everything  �  iDigBio  is  working  with  EMu  user  group  to  get  idenSfiers  into  Emu  

�  Symbiota  has  added  record  idenSfiers  to  all  their  collecSons  and  is  working  on  gedng  the  specimen  idenSfiers  from  collecSons  that  have  them.  

�  Tool  providers  and  developers  should  start  pushing  for  strong  idenSfiers  whenever  possible  

hIp://icons.iconarchive.com/,  hIp://www.cagrid.org,  hIp://en.wikipedia.org    

2  

Page 10: An#iDigBio# Perspecveon# DarwinCore Archives# · PDF filehIp://icons.  hIp:// , hIp://en.wikipedia.org%% 2. Author: Alex Thompson Created Date: 10/30/2013 3:16:50 PM

Advanced Computing and Information Systems laboratory

Broader  Barriers  –  Formadng  � Much  like  idenSfiers,  the  lack  of  defined  data  formats  can  seriously  hinder  data  use.  

� Where  possible  standards  bodies  should  reference  well  defined  standards  for  fields  

�  Even  without  acSon  from  standards  bodies,  tool  providers  should  at  least  incorporate  the  ability  to  reference  known  standards  for  formats  �  ISO  3166-­‐1  alpha-­‐3  for  countries  �  ISO  8601  for  dates  �  ISO  639-­‐2  for  languages  �  JSON  for  hashes  or  array  fields  

hIp://icons.iconarchive.com/,  hIp://www.cagrid.org,  hIp://en.wikipedia.org    

2  

Page 11: An#iDigBio# Perspecveon# DarwinCore Archives# · PDF filehIp://icons.  hIp:// , hIp://en.wikipedia.org%% 2. Author: Alex Thompson Created Date: 10/30/2013 3:16:50 PM

Advanced Computing and Information Systems laboratory

What  iDigBio  Does  Now  �  Specimens  and  Media  

�  Specimens  as  the  Core  � Media  in  an  Audubon  Core  extension  

�  Linked  via  coreid  

�  Record  IDs  associated  via  resource  relaSonship  � Media  Only  

�  CSV  files  with  Audubon  Core  fields  �  No  structure  descripSon  �  No  EML  

hIp://icons.iconarchive.com/,  hIp://www.cagrid.org,  hIp://en.wikipedia.org    

2  

Page 12: An#iDigBio# Perspecveon# DarwinCore Archives# · PDF filehIp://icons.  hIp:// , hIp://en.wikipedia.org%% 2. Author: Alex Thompson Created Date: 10/30/2013 3:16:50 PM

Advanced Computing and Information Systems laboratory

iDigBio’s  near  future  plans  �  Expand  the  use  of  resource  relaSonship,  measurement  or  fact,  and  dynamic  properSes  to  provide  more  ways  to  specify  non-­‐standard  properSes  as  first-­‐order  properSes.  �  Probably  move  to  using  dynamicProperSes  to  provide  record  ids  (only  fixes  Darwin  core  though).  

�  Build  a  full  python  implementaSon  of  a  reader  and  writer  that  supports  an  arbitrary  core  type  and  any  number  of  extensions.  �  Possibly  ship  core  &  extension  schemas,  when  not  available  on  the  internet  already,  in  the  DwC-­‐A  

�  Add  an  opSon  to  enforce  or  warn  a  uniqueness  constraint  on  the  core  id  field  

 hIp://icons.iconarchive.com/,  hIp://www.cagrid.org,  hIp://en.wikipedia.org    

2  

Page 13: An#iDigBio# Perspecveon# DarwinCore Archives# · PDF filehIp://icons.  hIp:// , hIp://en.wikipedia.org%% 2. Author: Alex Thompson Created Date: 10/30/2013 3:16:50 PM

Advanced Computing and Information Systems laboratory

iDigBio’s  suggesSons  for  DwC-­‐A  �  Ditch  /  Minimize  Core  

�  Replace  the  core  concept  with  a  staSc  backbone  schema  that  only  includes  �  Dcterms:idenSfier  (enforced  to  be  locally  unique,  recommended  to  be  

globably  unique)  �  Dcterms:modified  (in  ISO  8601)  �  Dcterms:type  �  A  deleted  flag  (to  enable  explicit  delete  signaling)  

�  Move  all  inter-­‐type  relaSonships  to  a  relaSonships  extension  like  resource  relaSonship  with  both  ends  of  the  relaSonship  poinSng  at  dcterms:idenSfier  in  the  core  (tools  should  enforce  referenSal  integrity)  �  Can  do  many-­‐to-­‐many  with  a  groups  extension  and  membership  

relaSonships.  �  All  data  is  now  an  extension  to  a  minimal  core  

hIp://icons.iconarchive.com/,  hIp://www.cagrid.org,  hIp://en.wikipedia.org    

2  

Page 14: An#iDigBio# Perspecveon# DarwinCore Archives# · PDF filehIp://icons.  hIp:// , hIp://en.wikipedia.org%% 2. Author: Alex Thompson Created Date: 10/30/2013 3:16:50 PM

Advanced Computing and Information Systems laboratory

DwC-­‐A  ImplementaSon  WG  �  Form  a  mulS-­‐insStuSon  ImplementaSon  working  group  with  the  goal  of  taking  the  current  or  a  new  standard  and  building  full  implementaSons  (readers  and  writers)  in  mulSple  languages  �  Common  test  specificaSons,  reference  files,  etc.  �  Release  all  the  implementaSons  on  a  common  github  (or  google  code)  repository  so  that  they  can  be  easily  reused  

hIp://icons.iconarchive.com/,  hIp://www.cagrid.org,  hIp://en.wikipedia.org    

2  

Page 15: An#iDigBio# Perspecveon# DarwinCore Archives# · PDF filehIp://icons.  hIp:// , hIp://en.wikipedia.org%% 2. Author: Alex Thompson Created Date: 10/30/2013 3:16:50 PM

Advanced Computing and Information Systems laboratory

Thanks  for  listening  �  Special  thanks  to  GBIF,  TDWG,  and  the  enSre  community  for  laying  a  great  foundaSon  to  build  on.  

� Also  thanks  to  Tim  Robertson,  Aaron  Steele,  and  the  other  aIendees  of  iDigBio’s  IT  standards  workshop  for  all  the  valuable  advice  and  steering  us  in  the  right  direcSons.  

hIp://icons.iconarchive.com/,  hIp://www.cagrid.org,  hIp://en.wikipedia.org    

2