making your data work for you: scratchpads, publishing & the biodiversity data journal

45
Making your data work for you: Scratchpads, publishing & the Biodiversity Data Journal Vince Smith 1 , Dave Roberts 1 & Lyubomir Penev 2 1. Natural History Museum, London 2. Pensoft Publishers, Sofia, Bulgaria [email protected] EBI, UK 25 September, 2012

Upload: vincent-smith

Post on 10-May-2015

750 views

Category:

Technology


1 download

DESCRIPTION

This is a derivative of a talk I gave at the Linnean society on 20th Sept. 2012. This version was given at the i4Life Environmental Genomics workshop on 25th Sept. and refocused to look at the dark taxa problem and developing published descriptions of molecular sequence clusters.

TRANSCRIPT

Page 1: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Making your data work for you:Scratchpads, publishing & the

Biodiversity Data Journal

Vince Smith1, Dave Roberts1 & Lyubomir Penev2

1. Natural History Museum, London2. Pensoft Publishers, Sofia, Bulgaria

[email protected]

EBI, UK25 September, 2012

Page 2: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Our informatics grand challenge…

“Link together evolutionary data… by developing analytical tools and proper documentation and then use this framework to conduct comparative analyses, studies of evolutionary process and biodiversity analyses”

Cyndy Parr, Rob Guralnick, Nico Cellinese and Rod Page. TREE. doi:10.1016/j.tree.2011.11.001

Page 3: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Our informatics grand challenge…

Cyndy Parr, Rob Guralnick, Nico Cellinese and Rod Page. TREE. doi:10.1016/j.tree.2011.11.001

This requires data, information & knowledge to be…

• Digital Not printed paper

• Openly accessible Not behind barriers

• Linked-up Not in silos

“Link together evolutionary data… by developing analytical tools and proper documentation and then use this framework to conduct comparative analyses, studies of evolutionary process and biodiversity analyses”

Page 4: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

• 15-20k new spp. described annually (2M total)1

• 30k nomenclatural acts (12M total) 1

• 20k phylogenies (750k total)2

• 31k taxa sequenced (360k taxa total)3

• 800k BioMed papers (40M total pp. of taxonomy) 4

• Countless specimens, images, maps, keys…

Most of our output is not digital, open or linked

Typically generated by small communities for “local” research projects

Figures from 1) Zhang, Zootaxa 2011 4, 1-4; 2) Web-of-Science; 3) Genbank and 4) PubMed.

Page 5: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

ScratchpadVirtual Research Environments

Making taxonomy digital, open & linked

Page 6: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Your data1

“Published” & reviewedon your site

3Uploaded &

tagged

2

Fast Intuitive Fit for use

What is a Scratchpad?A website for you & your community

Page 7: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Scratchpads• EDIT (07-11), ViBRANT / eMonocot (11-

13)

• Hosted websites for taxonomists• Taxonomic, regional or societal • Research & publication platform • Supports the taxonomic workflow • Modular (Drupal) & flexible • Two full time developers • Ecosystem of communities (~450)

http://scratchpads.eu

Page 8: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Categories of Scratchpads

Taxa(Classifications, taxon profiles, specimens, literature, images, maps, phenotypic,

genotypic & morphometric datasets, keys, phylogenies)

ProjectsConservation Regions Societies

Page 9: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Summary of what Scratchpads can do

• Taxon pages, generated from tagged content (plant/animal)• Bibliography management• Character matrixes• Specimen records• Distribution maps (from specimens and regional)• Images, video and sound (bulk import)• Excel spreadsheet import (dynamically generated)• Darwin Core Archive export• Tabular data editing• Custom content• User management• Custom webforms• EOL data import (taxonomy, species information)• GBIF Map integration

Page 10: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Nodes, 430, 948

Sites 326Users 6809Active Users 5733(273 w / 759 m)

Site

s Use

rs

Scratchpad v.1 usage (2007- Mar. 2012)

ViBRANT SP 2

• Prof. scientists• Amateur naturalists• Citizen scientists

Range: 1-1049Mean: 15Mode: 1

Page 11: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Scratchpad 2 – the new version of Scratchpads

• More professional• Easier to…

- configure (workflows)- navigate (facets)- & populate (MS Excel templates)

• Greater standardisation• Still highly flexible• Project profiles (eMonocot)• Framework for integration

• Launched March 2012• 120 sites to date• EOL Fellows• SP1 migration ongoing

e.g. http://ihs.myspecies.info/

Page 12: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Getting data in and out of Scratchpads 2

Page 13: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Online community revision

Freeloader flieshttp://milichiidae.info

• Taxonomy is in perpetual beta- Constantly evolving- Changing contributors- Small granular contributions

• Sustainability- A permanent space to work- Guaranteed access (2016)- Easy ways to get the data out

• Open science- Beyond Open Access- New ways of working- Data management plans

• Need incentives to use- More efficient (functions & reuse)- Attribution & provenance- Credit via citation

• New forms of publication

Page 14: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Publishing observations & taxon data

Specimen records & species pages on Scratchpads

Pushed to GBIF & EOL(requires site registration with

GBIF & EOL)

>19K specimen records> 122k species pages

>377M specimen records GBIF> 1 M species pages in EOL

http://scratchpads.eu > http://gbif.org & http://eol.org

Darwin Core

Archive (DwCA)

Page 15: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Experiments with article publishing

Paper assembled from Scratchpad database

XML submission, peer review & marked-up publication by Pensoft

5-step workflow for selecting data, adding metadata & previewing

Published in Zookeys & Phytokeys(worldwide coverage)

PD

FH

TM

LX

ML

http://scratchpads.eu > http://pensoft.net

doi:10.3897/zookeys.50.539

Page 16: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Example papers via Scratchpads…Blagoderov V, Hippa H, Nel A (2010). ZooKeys 50:

79–90. doi: 10.3897/zookeys.50.506Faulwetter S, Chatzigeorgiou G, Galil BS,

Nicolaidou A, Arvanitidis C (2011. ZooKeys 150: 327–345. doi: 10.3897/zookeys.150.1877

Brake I, von Tschirnhaus M (2010). ZooKeys 50: 91–96. doi: 10.3897/zookeys.50.505

http://milichiidae.info/node/14995http://polychaetes.marbigen.org/node/35http://sciaroidea.info/node/44428

Live (updated) versions of these papers

Page 17: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

BDJThe Biodiversity Data Journal

Making small data big!

Page 18: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

BUT…• We need to encourage taxonomists to mobilize & describe their data• This takes considerable effort (e.g. Scratchpads)• “Arguably” this is best rewarded through credit• This means papers and citations• Process must be very easy for authors• Process must facilitate data reuse• Meet “Open Data” policy commitments

• The Biodiversity Data Journal is very different…

Why do we need another new journal!!!Taxonomy needs less fragmentation, not more!

Page 19: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Biodiversity Data Journal (BDJ)

• All data matters: No lower or upper limit of manuscript size!• Multiple publishing routes (not just Scratchpads)• ALL within a single online collaborative platform, including

the writing of the manuscript!• New collaborative article authoring tool• Community peer review with “open” &“public” options• This is in addition to conventional peer-review• Online editorial process and version control• Standards-compliant (Darwin Core, Dublin Core, NLM etc.)

• Pre-defined Code-compliant article templates

Page 20: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

BDJ publication & dissemination workflow

Page 21: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Pensoft manuscript writing tool

• Collaborative online editing• Rich text capabilities• Various templates for taxon treatments• Identification keys builder

• Assembling plates from single figures• References import• (CrossRef, PubMed Central, etc.)

• Species occurrence data import (Darwin Core compliant)

• Smart citation for figures, tables, references & automated positioning

Page 22: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Testing screenshots of the writing tool

ID Keypreview

Multi-figure plates Plate layout

ID Keybuilder

Manuscript preview

Page 23: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Why publish in the BDJ?

• Joining (small) data into a large data pool• Open-access, archiving and re-using your data

through data aggregators • Providing citation record and creditability for data in

the form of peer-reviewed publications• Facilitating online article authoring and editorial

process for authors, reviewers and editors• Using a truly innovative dissemination of atomized

content• Very low-cost. Free in the launch phase, thereafter at

fee that anyone can afford!

Page 24: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

What will BDJ publish?

• Single taxon treatments and nomenclatural acts • Local or regional checklists• Sampling reports and occasional inventories• Habitat-based checklists and inventories• Ecological and biological observations of species

and communities?• Single identification keys • ANY KIND of biodiversity-related database, including

genomic, ecological and environmental data (data papers)

• Biodiversity-related software tools

Starting late 2012, early 2013 Recruiting editors now

Page 25: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

BDJBarcoding, genomic &

environmental sequence papersMaking small data big!

Page 26: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Mammal taxa added to Genbank annually

Proper Linnaean names

Aus sp.

= dark taxa", taxa (specimens) that aren't identified to a known species

Page 27: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Proportion of mammal dark taxa in Genbank

Proper Linnaean names

Aus sp.

Page 28: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

BOLD

Proportion of invert. dark taxa in Genbank

Page 29: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Dark taxa are the norm for bacteria

Page 30: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

A lesson in principles for dealing with dark taxa

Roth v. Wikipedia

http://www.newyorker.com/online/blogs/books/2012/09/an-open-letter-to-wikipedia.html

Page 31: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

But Wikipedia said “no”

“I understand your point that the author is the greatest authority on their own work,” writes the Wikipedia Administrator—“but we require secondary sources.”

Page 32: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

But Wikipedia said “no”

One of Wikipedia’s core principles, along with things like neutrality, is verifiability: a reader must be able to look at a statement in a Wikipedia article and find out where it comes from.

http://quominus.org/archives/981

Page 33: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Lessons for taxonomy & dark taxa…

http://quominus.org/archives/981

Taxonomic statements should be verifiable

Literature is the evidence base for taxonomy

Literature should be the evidence base for dark taxa

Page 34: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Example templates & dissemination

BIODIVERSITYMANUSCRIPT

Occurrence data “Dark” taxon data

Image galleries

Morphometric data

Environmental sequence data

Genome descriptions

Any other data

XML MARK UP

Structured text (data!)

ARTICLESOccurr-

ence dataTaxon namesTaxon treatments

Plazi

BHL

Wiki COL

Biblio-graphies

Page 35: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Example template & data fields

Page 36: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Workflow describing “Dark Taxa”

PWT – COLLABORATIVE ARTICLE AUTHORING TOOLDark taxon sequenced

BDJ – PEER-REVIEW

Automated submission to Pensoft Writing Tool

MANUSCRIPT PUBLISHED

Metadata: voucher specimen,

images, locality, etc.

MANUSCRIPT FINALISATION & SUBMISSION

Automated update of bibliographic metadata, taxon name, Zoobank record, etc.

Page 37: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Data published

Descriptions

Images

Occurrences

Nomenclature

Literature

Plazi

Page 38: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

“Dark Taxon” papers

• Should contain…- The scope of the taxonomic, ecological & geographic coverage- The sources of voucher specimens- The sampling & lab. protocols used- The process used to ID taxa to which vouchers belong

• Possible data fields include…- Average no. of records per taxon- Range of records per taxon (Min-Max)- Average, min. and max. sequence length- Range of intraspecific variation- Median variation with in taxon X%- Range of divergence to closed know taxon pairs (min & max?)- Median divergence between closest taxon pair

Page 39: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Possible discussion points…

• The concept…- Is it a good approach to incentivize data publishing & good metadata

practices?- The suitability for “Dark Taxa”, new genomes and env. sequence data- Is this more suitable for some data papers (e.g. dark taxa) than others?

• The practicalities…- The fit to existing systems (both for data collection and dissemination)- The data fields (Dark Taxa”, new genomes and env. sequence data)- Next steps in developing this concept

Page 40: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Acknowledgements

• Scratchpad technical development- Simon Rycroft, Ben Scott, Ed Baker, Alice Heaton, Katherine Boulton,

• Scratchpad outreach- Irina Brake, Laurence Livermore, Dimitris Koureas

• E-Monocot - Paul Wilkin &the Kew team, Charles Godfray & the Oxford team

• ViBRANT- Dave Roberts, Lucy Reeve & many many more

• Pensoft- Lyubomir Penev, Teodor Georgiev & colleagues

• Our 7,000+ users

Page 41: Making your data work for you: Scratchpads, publishing & the biodiversity data journal
Page 42: Making your data work for you: Scratchpads, publishing & the biodiversity data journal
Page 43: Making your data work for you: Scratchpads, publishing & the biodiversity data journal
Page 44: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Why we need new methods of publishing…

Primary data

Drawings: Slavena Peneva

Publishing and sharing of primary data

RE-USEof

CONTENT

Page 45: Making your data work for you: Scratchpads, publishing & the biodiversity data journal

Source: Wikipedia