knowledge sharing and collaborative problem solving in biodiversity informatics andrew c. jones...

45
Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

Upload: randy-worley

Post on 15-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

Knowledge Sharing and Collaborative Problem Solving in

Biodiversity Informatics

Andrew C. Jones

Cardiff University, UK

Page 2: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

2

The Species 2000 vision

• To enumerate all known species of plants, animals, fungi and microbes on Earth as the baseline dataset for studies of global biodiversity

• To provide a simple access point enabling users to link from Species 2000 to other data systems for all groups of organisms, using direct species-links

• To enable users worldwide to verify the scientific name, status and classification of any known species through species checklist data drawn from an array of participating databases

• (More recently) to provide a “synonymy server” for use as a service by other applications needing to obtainsuitable scientific names, e.g. for queryingbiological data sets

Page 3: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

3

Need for a catalogue

• Suppose we wished to retrieve all locations where specimens of Caragana arborescens have been collected, from various specimen distribution databases.

• A taxonomic checklist might include:Caragana arborescens Lam. [accepted name] Caragana sibirica Medikus [synonym]

• Classification of organisms is based on opinion regarding

– what the groups are– identification of individuals

• So we need to use both these names as search terms

• In practice the problem might be far worse

Page 4: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

4

SPICE for Species 2000: Meeting the Computing challenges

• The SPICE for Species 2000 project aimed to:– build a federated ‘registry’ of scientific names organised by taxon

(species, etc.)– accommodate GSD (Global Species Database) heterogeneity– accommodate GSD autonomy & instability– ensure scalability

• Funding:– SPICE was funded by the UK BBSRC/EPSRC Bioinformatics panel– EuroCat – new EU-funded project to augment

SPICE catalogue of life & develop/maintainSPICE software

Page 5: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

SPICE Project StaffCardiff – Prof. Alex Gray, Dr. Andrew Jones, Prof. Nick. Fiddian, Dr. Xuebiao Xu,(Mr. Nick Pittas).

Object and Knowledge-based Systems Group, Department of Computer Science, CardiffUniversity, PO Box 916, Cardiff CF24 3XF

Email: {W.A.Gray|Andrew.C.Jones|N.Fiddian|X.Xu|N.Pittas}@cs.cf.ac.uk

Telephone +44 (0)29 2087 4812

Reading – Prof. Frank Bisby, Prof. Sir Ghillean Prance and Dr. Sue Brandt.

Centre for Plant Diversity & Systematics, The University of Reading, Reading RG6 6AS

Email: {F.A.Bisby|S.M.Brandt}@reading.ac.uk

Telephone +44 (0) 118 378 6437

Southampton – Dr. Richard White and Mr. John Robinson.

Biodiversity & Ecology Research Division, School of Biological Sciences,University of Southampton, Southampton SO16 7PX

Email: {R.J.White|J.S.Robinson}@soton.ac.uk

Telephone +44 (0)23 8059 2021

Royal Botanic Gardens, Kew - Prof. Peter Crane, Dr. Don Kirkup,Ms. Sally Hinchcliffe, Mr. Graham Christian and others

Natural History Museum, London - Prof. Paul Henderson, Mr. Charles Husseyand others

BIOSIS UK - Mr. Michael Dadd, Ms. Judith Howcroft and others 5

Page 6: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

6

Interactive use of SPICE …

Page 7: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

7

Page 8: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

8

Page 9: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

9

Page 10: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

10

Page 11: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

11

Basic uses for the catalogue

• User wishes to check taxonomy of some organisms interactively; or

• User wishes to access or store data (observations, gene sequences; …) associated with a given species:– Catalogue gives information about accepted

name/synonyms– Can use all names for retrieval, for example– May well want to use the accepted name provided

by SPICE for storing new data.

Page 12: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

12

The “standard data”

• Comprises the information about a species which Species 2000 wishes to provide:– AVCNameWithRefs– SynonymWithRefs– CommonNameWithRefs– Family– Comment– Scrutiny– DataLink– Geography

• Minimalistic CDM devised:– The basic information needed for a catalogue of life;– If GSD can’t be wrapped to conform, probably doesn’t

contain required information

Page 13: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

13

Request Types 0-5

• Again, a fairly simple set of operations is required:– Type 0: Get CDM version compliance for a GSD– Type 1: Search for a name in a GSD– Type 2: Fetch “standard data” about a chosen

species– Type 3: Get information about a GSD– Type 4: Move up the taxonomic hierarchy– Type 5: Move down the taxonomic hierarchy

Page 14: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

14

Type 1 response (XML) extract<type1result> <SPECIESNAME> <SYNONYMWITHAVC> <SYNONYM> <FULLNAME> <GENUS>Abrus</GENUS> <SPECIES>abrus</SPECIES> <AUTHORITY>(L.) Wright</AUTHORITY> </FULLNAME> <INFRASPECIFICPORTION> </INFRASPECIFICPORTION> <SYNONYMSTATUS>synonym</SYNONYMSTATUS> </SYNONYM> <AVCNAME> <FULLNAME> <GENUS>Abrus</GENUS> <SPECIES>precatorius</SPECIES> <AUTHORITY>L.</AUTHORITY> </FULLNAME> <AVCSTAT>accepted</AVCSTAT> <IDL>1571</IDL> </AVCNAME> </SYNONYMWITHAVC></SPECIESNAME><SPECIESNAME> …

Page 15: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

15

SPICE architecture

GSD GSD

Wrapper(e.g. JDBC)

Wrapper(e.g.CGI/XML

+ ODBC)

User(Web Browser)

User(Web browser)……

……

(in some cases, generic) CORBA ‘wrapper’ element of GSD Wrapper

User Server module(HTTP)

‘Query’ co-ordinator

CAS knowledge repository(taxonomic hierarchy, annual checklist, genus

and other caches, ...)

Common Access System (CAS)

CORBA

Internalwrapper

Externalwrapper

XMLCGI

Page 16: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

16

Why a federation of autonomous, heterogeneous GSDs?

• Taxonomists have specialist knowledge of a limited range of organisms, and want to make their data available in various ways

• So– the hierarchy is divided into sectors, with an

individual or group of scientists responsible for each

– scientists are given control over their databases– we accommodate existing heterogeneous GSDs;

also new ones built for various purposes

• This helps assure taxonomic data quality (peer review of GSDs is also used)

Page 17: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

17

Specialist GSDs mean better data quality than non-specialist ones …

• … but data quality problems still arise:– “Non-overlapping” sectors may, in fact,

overlap– GSDs may be inconsistent taxonomically– GSDs may be formed by merging two or

more other databases, mutually inconsistent

Page 18: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

18

LITCHI Project

A rule-based tool for the detection and repair of conflicts and merging of data

in taxonomic databases

Page 19: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

19

Project StaffSuzanne Embury, Alex Gray, Andrew Jones, Iain Sutherland Object and Knowledge-based Systems Group, Department of Computer Science, University of Wales, Cardiff, PO Box 916, Cardiff CF24 3XF

Frank Bisby, Sue Brandt Centre for Plant Diversity and Systematics, School of Plant Sciences, The University of Reading, Reading RG6 6AS

John Robinson, Richard WhiteBiodiversity & Ecology Research Division, School of Biological Sciences, University of Southampton, Southampton SO16 7PX

Page 20: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

20

Summary

• We modelled the knowledge integrity rules in a taxonomic treatment

• The knowledge tested is implicit in the assemblage of scientific names and synonyms used to represent each taxon (examples later)

• Practical uses include detecting and resolving taxonomic conflicts when merging or linking two databases

Page 21: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

21

Example 1

Checklist A

• Caragana arborescens Lam. [accepted name] Caragana sibirica Medikus [synonym]

Checklist B

• Caragana sibirica Medikus [accepted name]Caragana arborescens Lam. [synonym]

Page 22: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

22

Example 2

In the case of the species Cytisus scoparius

Treatment A will list it as Cytisus scoparius (synonym Sarothamnus scoparius)

Treatment B will list it as

Sarothamnus scoparius (synonym Cytisus scoparius)

GenusCytisus

GenusSarothamnus

GenusCytisus

Cytisus scoparius Sarothamnus scopariusCytisus striatus Sarothamnus striatus

Cytisus multiflorus Cytisus multiflorusCytisus praecox Cytisus praecox

Treatment Arecognises one genus, Cytisus

Treatment Brecognises two genera,

Cytisus and Sarothamnus

Page 23: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

23

Example of a rule• In each of the 2 examples, merging the checklists would lead

to violation of:– “A full name which is not a pro-parte name may not appear as both an

accepted name and a synonym in the same checklist”

• (Violations of other rules help user to distinguish the taxonomic causes; various options to repair thisviolation)

)(_)(_

),,,,(),,,,(_

,,,,,,

21

2211

2121

cparteprocpartepro

tlcansynonymtlcannameaccepted

ttcclan

violation:- accepted_name(N,A,C1,L,T1), synonym(N,A,C2,L,T2), (\+pro_parte(C1); \+pro_parte(C2)).

Page 24: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

24

Conflict display

Page 25: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

25

LITCHI: current status

• Good selection of rules (for botanical nomenclature)

• A research project, now in need of re-engineering:– Implemented in Prolog & Visual Basic; not

portable– Uses XDF file format for data import/export

Page 26: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

26

Some future developments of LITCHI• BiodiversityWorld

– BiodiversityWorld is not funded to develop LITCHI at all, but will be able to take advantage of LITCHI developments for ‘taxonomically intelligent navigation’

• EuroCat– Re-engineer LITCHI, to work with GSDs wrapped to

SPICE CDM 1.2– Use for

• Intra- and inter- GSD consistency checking• Navigation between resources organised according to differing

taxonomies, e.g. for access to regional hubs

– Use in conjunction with, and for generating, ‘cross-maps’

Page 27: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

27

Litchi in (future) useChecklist A Checklist B

Rules

Conflict description

Possible repairs

Cross-map

Taxonomic intelligence

Read into system

Write

Conflict detection

Conflict display

Conflict repair (not necessarily used in this context)

Page 28: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

28

BiodiversityWorld

• Problem solving environment for biodiversity informatics on the GRID

• UK BBSRC-funded

• Universities of Reading, Cardiff & Southampton, and The Natural History Museum, London

Page 29: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

29

BiodiversityWorld – The Challenge

Some difficult Biodiversity questions• How should conservation efforts be concentrated?

– (example of Biodiversity Richness & Conservation Evaluation)

• Where might a species be expected to occur, under present or predicted climatic conditions?– (example of Bioclimatic modelling and Climate Change)

• Is geography a good predictor of relationship between lineages? (e.g. are the more closely related species found near each other?)– (example of Phylogenetic Analysis & Biogeography)

Page 30: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

30

Some relevant resource types

• Data sources:– Catalogue of life– Species Information Sources (SISs)

• Species geography• Descriptive data• Specimen distribution

– Geographical• Boundaries of geographical & political units• Climate surfaces

– Genetic sequences• Analytic tools:

– Biodiversity richness assessment – various metrics– Bioclimatic modelling – bioclimatic ‘envelope’ generation– Phylogenetic analysis (generation of phylogenetic trees)

Page 31: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

31

Some challenges …

• Finding the resources• Knowing how to use these heterogeneous

resources– Originally constructed for various reasons– Often little thought was given to standards or

interoperability

• One important specific issue: using appropriate scientific name for SIS queries (hence SPICE for Species 2000)

Page 32: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

32

Our vision

• Biodiversity Problem Solving Environment –– Heterogeneous diverse resources– Flexible workflows– Main challenges centre around metadata,

interoperability, etc;– High-performance computing secondary (though

relevant)

• Our previous GRAB demonstrator illustrates some Bioclimatic Modelling elements, with a fixed workflow …

Page 33: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

33

Typical GRAB displayWeb browser‘front-end’ tothe GRABserver

Appletmonitoringcommunicationbetween GRABserver andGRABdatabases

Page 34: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

34

Why the GRID for BiodiversityWorld (or even GRAB?)

• HPC; mobility of data & programs• Resource discovery• OGSA (Open Grid Services

Architecture) – not Globus-specific – gives Web Services & life cycle management, etc

• Workflow for orchestrating resources, etc.

Page 35: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

35

Taxonomic index (SPICECatalogue of Life)

Analytic tool

Thematic Data source

BioD-GRIDOntology:MetadataIntelligent linksResource & Analytic tool descriptionsMaintenancetools

Proxy

ProxyProxy

Proxy

Proxy

Abiotic Data source

Analytic tool

Proxy

User

Local tools

Problem Solving Environment User Interface

GSDGSD

GSDGSD

Problem Solving Environment:Broker agentsFacilitator agentsPresentation agents

BiodiversityWorld architecture

Page 36: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

36

Bioclimatic modellingCase Study - Leucaena

leucocephala• Leucaena leucocephala (Lam.) De Wit

• Native of Central America• Widely introduced around the tropics• Widely utilised around the globe for:

– Wood– Forage– Soil enrichment and erosion control

• Regarded as an invasive weed in some areas

Page 37: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

37

Point data from various herbaria

Page 38: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

38

Distribution data from ILDIS database

Page 39: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

39

GARP prediction of climatic suitability

Page 40: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

40

Workflow

• Our PSE should provide flexible support for development of complex workflows for:– experimental design of in silico

biodiversity-related experiments– repeatability– modification of experiments

Page 41: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

41

START

STAGE 1

STAGE 2

STAGE 3 Analytical Toolbox

Reference to Abiotic datasets

Species 2000 Catalogue of

Life

Dis

tribu

ted

A

rray o

f G

SD

’s

Enquiry name(s)

Returns list of accepted taxa, synonyms and common names

Distributed array of

thematic data sources

Enquiry: select ‘data’ for ‘taxon set’

Return dataset composed ofhomologous responses from

multiple thematic data sources

Presentation and storage of results

Typical workflow

Page 42: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

42

Initial test workflow

SPICE

LocalitiesClimate

Space Model

Base Maps

Climate

Climate

Prediction

Submit scientificname; retrieveaccepted name& synonymsfor species

Retrievedistribution mapsfor species ofinterest

Climatesurfaces

Model of climatic conditionswhere species is currentlyfound

Possibly differentclimate surfaces(e.g. predictedclimate)

World orregionalmaps

Prediction of suitableregions for speciesof interest

Page 43: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

43

BiodiversityWorld – much more complex than SPICE

• Much more heterogeneity– diverse kinds of databases and tools

• Much greater range of data quality and terminology problems, e.g.– accuracy of “point data”– country names– …

Page 44: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

44

Role/use of metadata

• Descriptive• Create electronic book for user• Create workflows

– necessary transformations– provenances– interoperability

• Locate appropriate elements• Rerun processing (possibly with

modifications)

Page 45: Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK

45

Conclusion

• The field of biodiversity informatics presents various challenges including:– taxonomic/naming– heterogeneity & autonomy– data quality– need for extensive metadata