Download - Building data infrastructures for science
Building data infrastructures for scienceVince Smith
Informatics Horizons, London24 July 2013
Overview
1. (my) Background• Lice to data infrastructures!• Why data infrastructures at the NHM
2. Building data infrastructures• Recent core investment in NHM infrastructures• Leveraging external investment in NHM infrastructures • Infrastructure design principles & coordination
3. NHM 5-year data infrastructure horizons• Collections digitisation• Large-scale use of collections data • New approaches to biodiversity discovery
4. Decadal community infrastructure challenges • The long view – science data strategies• Data modeling and real time monitoring as a unifying theme
1. (my) Background
Lice to data infrastructures!
Systematics (circa 1998)- No high level keys- Poor high level taxonomy - Just one phylogeny- Few living experts!
Circa 5,000 spp. Mammals & birds
12,000 associations 15,000 potential hosts
My data infrastructure (circa 1998)
Palma, R.L., and R.L.C. Pilgrim. 2002. A revision of the genus Naubates (Insecta: Phthiraptera: Philopteridae). J. R. Soc. N.Z. 32:7-60.
data in 4 of 54 pages,in 1 of 9,110 taxonomic
142 pieces of “raw”
papers on lice
- Taxonomic names- Authorities (name concepts)- Citations- Collection data- Morphological characters- Textual descriptions- Diagnostic keys
- Illustrations- Photographs
“The bane of my existence is doing things that I know the computer could do for me”
-- Dan Connolly, The XML Revolution(Nature, 1998)
http://darwin.zoology.gla.ac.uk/~rpage/LouseBase/2/
LouseBASE
Specimens Images
(SID)
http://darwin.zoology.gla.ac.uk/~SID/
Literature
PHPBibhttp://myphpbib.sourceforge.net/
Lab Notebook
http://www2.flmnh.ufl.edu/pdb/
Host-Parasite Checklists
http://www2.flmnh.ufl.edu/adb/
Glasgow version at:
My data infrastructures (circa 2004)
ScienceProc. R. Soc. BSyst. Biol. Mol. Phyl. Evol.
Zoo. ScriptaBiol. Letters
My publications in 2004 (enabled by these infrastructures)
PLoS BiologyGrzimek’s Ency. Ent. Abh.
Images LiteratureSpecimens ChecklistsLab Notebooks
Making louse research more efficient, more collaborative and more productive
Why data infrastructures at the NHM: lots of potential
Card indices ArchivesLibrary
Staff LabelsFrozen Tissue
Slides DrySpirit
2. Building data infrastructures
Recent NHM investment in science data infrastructures
1. KE EMu (collections data)• Improved interface (speed, complexity, data quality, support)• Rapid Data Entry Web-Interface• Improved import & export functionality (CLD & data portal)
2. DAMS (multimedia) ?• Review (Digital Strategy Group)
3. NHM Virtual Library (literature)• Integrated search & discovery of NHM resources• Better integration with external resources
4. NHM Data Portal (access, citation & archival)• Discovery & visualisation of collections data on the Web• Web exposure & archival of NHM research datasets• Sub-portals for collaborative projects• As strategically important as the Web in 3 years time!
Enabling the NHM mission?
Collections Public Engagement Research
What are Scratchpads? (http://scratchpads.eu)External investment in science data infrastructures
1. ViBRANT (EU FP7 Infrastructures, 17 partners, €4.75M)• Virtual Biodiversity Research & Access Network for Taxonomy• Building & integrating tools supporting biodiversity research communities
(publishing, literature & vocabulary management, ID keys, conservation assessments,
mapping & visualisation tools, citizen science support)
2. e-Monocot (NERC Consortium; Kew Oxford & NHM, £2.38M)• Sustainable, integrated resource on Monocot plants• Content and supporting digital infrastructure
(Complete family level keys & taxon pages; generic keys & pages for 8 families; select
species-level resources from European Monocots, Red-list species and Slipper orchids)
3. SYNTHESYS 1,2 & 3 (EU FP5/6/7 Infrastructures, 18 partners, €10M)• Support for physical access to participating collections• JRA: Research into mass collections digitisation
(Image analysis, segmentation, transcription & crowdsourcing)
4. Others• Open-UP• BHL-EUROPE
ViBRANTVirtual Biodiversity
What are Scratchpads? (http://scratchpads.eu)Scratchpad VRE: foundation for ViBRANT & eMonocot
Taxa(Classifications, taxon profiles, specimens, literature, images, maps, phenotypic, genotypic
& morphometric datasets, keys, phylogenies)
Conservation Projects Regions Societies
What are Scratchpads? (http://scratchpads.eu)Impact: Scratchpad usage (July 2013)
65,000 unique visitors/month
Per month unique visitors to Scratchpad sites
525 Scratchpad Communities
by 6,550 active registered users
covering 73,444 taxa
in 535,317 pages. 81 paper citations in 2012
In total more than
1,300,000 visitors
119 NHM staff,
83 sites
3. Our near-term infrastructure horizons
Digital Ambition: NHM Science Strategy 2013-2017
A New Voyage of Discovery
Three Focal Areas1. Scientific discovery2. Scientific infrastructure3. Scientific engagement
Five Challenges1. The digital NHM2. Origins, evolution & futures3. Biodiversity discovery4. Natural resources & hazards5. Science, society & skills
Resources & funding
Measuring success
A New Voyage of Discovery
Three Focal Areas1. Scientific discovery2. Scientific Infrastructure3. Scientific engagement
Five Challenges1. The digital NHM2. Origins, evolution & futures3. Biodiversity discovery4. Natural resources & hazards5. Science, society & skills
Resources & funding
Measuring success
Digital Ambition: NHM Science Strategy 2013-2017
Collections digitisation
Large-scale use of collections data
New approaches to biodiversity discovery
Collections digitisation (data mobalisation)
Target20M specimens available digitally in 5-years
ChallengesCurrent fragmented effortsHeterogeneity of processExisting data (2.8M lots; 400k geo.; 120k images)Scale of operation (iCollections, 130k in 1 year)Transcription (Citizen Sci. / crowdsourcing)Data quality, annotation & feedback
Resources & fundingExpensive (£20-£60M @ £1-3 per specimen)Linked to our public offer
Next steps (Sept. 2013)Coll. Descriptions & protocolsGreater coordination of effortProgramme group with project portfolio?Planning of digital access via NHM Data Portal
Large scale use of collections data (or why digitise)
Data applications help set digitisation priorities
Potential applications for NHM dataInvasive alien speciesImpacts of climate changeSpecies conservation & protected areasImpacts of human developmentBiodiversity & human healthFood, farming & biofuels
Sustainable delivery of dataNHM Data portalPromote access & reuse of dataSub-portals for specific themesDelivering content to third parties (e.g. GBIF)
Next steps (requirements)Storage (Access, backup & archival)Citation, linking & measuring impact (identifiers)Data layering & visualisationH.P.C. (Ecol. niche modeling & analysis)
NHM Data Portal
Data visualisation
Poaceae
Leguminosae
Brassic
aceae
Rosaceae
Solan
aceae
Compositae
Rubiaceae
Vitacea
e
Anacard
iaceae
Araceae
Arecace
ae
Moraceae
Malvace
ae
Musaceae
Cucurb
itacea
e
Amaryllid
aceae
Grossu
lariac
eae
Amaranth
aceae
Aquifoliac
eae
Theac
eae
Jugla
ndaceae
Euphorb
iaceae
Apiaceae
Caricac
eae
Aspara
gaceae
Dioscorea
ceae
Pedalia
ceae
Rutaceae
Laurac
eae
Betulac
eae
Convolvu
laceae
Myrtace
ae
Oleacea
e
Zingib
eracea
e
Bromelia
ceae
Piperacea
e
Lecyth
idaceae
0200400600800
1000 Crop Wild Relatives
New approaches to biodiversity discovery (new types of data)
Take home messages from NHM Tropical Biodiversity Symposium
Molecular approachesMolecular detection & monitoring of organisms is routineMetagenomics (env. sequencing) commonplaceWhole genomes are normalThe primary route to understanding biodiversity for many
Ecological observatoriesAutomated biodiversity detectionRemote sensing (e.g. satellite & acoustic data, drones, camera traps)
Monitoring conspicuous, rare or invasive spp. (algal blooms, palms) Monitoring human activitySupplement field research, fills in gaps & scales
Digital infrastructure requirementsVery large quantities of data (2.5-10TB per researcher per yr.)
Doesn’t map to existing NHM collections infrastructuresChallenge current networking & storage capacity Digital and physical collections become equally important?
3-4 June 2013, NHM
22 July, 2013
4. Community decadal challenges
The long view: community informatics challenges
GBIF GBIC Report(Coming soon)
EU Biodiversity Strategy(2011)
Biodiv. Inf. Challenges(2013)
Modeling the biosphere: a (the) 30 year goal?
Nature 2013, doi:10.1038/493295a
A clear, singular long-term vision, that NHM data
can contribute too
QUESTIONS
What are Scratchpads? (http://scratchpads.eu)Infrastructure design principals*
1. Start with needs - focus on real user needs (not just the ‘official process’)
2. Do less - if someone else is doing it, link to it or use it
3. Design with data - prototype and test with real users on the live website
4. Do the hard work to make it simple - let the computer take the strain
5. Iterate. Then iterate again. - iteration reduces risk & is more sustainable
6. Build for inclusion – it’s easier in the long run
7. Understand context - we are designing for people, not a screen or a brand
8. Build digital services, not websites - there is life beyond the website
9. Be consistent, not uniform - every circumstance is different
10. Make things open: it makes things better - it’s more sustainable
= experience from 7-years with the Scratchpads= lessons for building NHM data infrastructures?
*https://www.gov.uk/designprinciples
What are Scratchpads? (http://scratchpads.eu)Better NHM digital coordination from 2013
Digital Strategy Group
Developing common vision High level strategy
Director level engagement(Science, PEG & Corp. Services)
Digital Design Group
Delivering & leading digital activitiesFund raising (internal & external)
Prioritisation
Administrative supportResource management
Analysis of impact
Digital Programme
Group