http:// common data models and protocols richard white, cardiff university talk given at “making...
TRANSCRIPT
http://www.systematics.rdg.ac.uk/spice/
Common Data Models and ProtocolsCommon Data Models and Protocols
Richard White, Cardiff University
Talk given at “Making Species Databases Interoperable”, Reading, 15 July 20042
SPICE for Species 2000Funded in the UK by the
BBSRC/EPSRC Bioinformatics Initiative
Universities of Cardiff & Readinghttp://www.systematics.rdg.ac.uk/spice/
http://www.systematics.rdg.ac.uk/spice/
Species 2000Species 2000
The story so far ...
Species 2000 is an international collaborative project to create and provide access to an authoritative and up-to-date checklist and index to all the world’s species.
How is it going to do this?
http://www.systematics.rdg.ac.uk/spice/
Species 2000 services to usersSpecies 2000 services to users
Dynamic Checklist Annual Checklist Web site, including database links submitted
by users or producers Distribution media, including downloaded
data Index to species information (hyperlinks to
SISs) Packaged functions providing services to other
software
http://www.systematics.rdg.ac.uk/spice/
Species 2000 organisationSpecies 2000 organisation
Taxonomic hierarchy (or hierarchies)
Species
Global species databases (GSDs) and interim
checklists: the species index GSDinterim
checklists
Species information sources (SISs): regional faunas and floras, specialist or sectoral
databases, web pages etc.
SIS
http://www.systematics.rdg.ac.uk/spice/
Merging & Linking
MergingThe original databases are physically copied into a new combined database
LinkingThe original databases remain separate, but are accessed through a single system
http://www.systematics.rdg.ac.uk/spice/
Merging
1. The original databases are physically copied into a new combined database.
2. The user interacts with the new combined database.
Plants ofEurope
Plants ofAfrica
Plants ofthe World
1
2
http://www.systematics.rdg.ac.uk/spice/
Linking
1. The user interacts with an access system which does not itself contain data.
2. When the user requests data, it is fetched from the appropriate database.
Plants ofEurope
Plants ofAfrica
Plants ofthe World
2
1
http://www.systematics.rdg.ac.uk/spice/
Architecture of Species 2000Architecture of Species 2000
User interface
Data collector
Wrapper
GSD
Wrapper
GSD
Wrapper
GSD
CAS
(Common Access System)
or “harness”
Protocol
Distributed array of databases
http://www.systematics.rdg.ac.uk/spice/
Need for communicationNeed for communication
Different people are building the various components of the system:– GSDs– wrappers– CAS– user interface
We need to ensure they all have a common understanding of the data to avoid embarrassing mistakes
http://www.systematics.rdg.ac.uk/spice/
Database wrappersDatabase wrappers
Only the interface to the CAS needs to speak CORBA
Wrappers must:– Translate CAS requests into a form
suitable for the GSD (e.g. SQL) and translate responses back
– Deal with other kinds of heterogeneity, including schema heterogeneity
http://www.systematics.rdg.ac.uk/spice/
Data flow through a wrapper Data flow through a wrapper
Divided wrapper
GSD
Wrapper interface
CAS
External wrapper
XML
Strings e.g. CGI
http://www.systematics.rdg.ac.uk/spice/
Common Data ModelCommon Data Model
We need a Common Data Model (CDM)– A definition of the information being
passed to and fro– Human-readable, not machine-readable– This is used as a reference when creating
specific implementations for CGI/XML (DTD, XML Schema), Web Services, etc.
http://www.systematics.rdg.ac.uk/spice/
What does the CDM look like?What does the CDM look like?
It defines the input (“request”) and output (“response”) for six fundamental operations which the system needs to be able to carry out
http://www.systematics.rdg.ac.uk/spice/
Request Types 0-6Request Types 0-6
– Type 0: Get version of the CDM with which the GSD complies
– Type 3: Get information about the GSD– Type 1: Search for a name in the GSD– Type 2: Fetch “standard data” about a
chosen species– Type 4: Move up the taxonomic
hierarchy– Type 5: Move down the taxonomic
hierarchy
http://www.systematics.rdg.ac.uk/spice/
Type 0 RequestType 0 Request
Request:– (nothing)
Response:– CDMVersion
http://www.systematics.rdg.ac.uk/spice/
Type 3 RequestType 3 Request
Request:– GSDIdentifier
Response:– GSDInfo (a set of fields including its name,
date of last editing, etc.)
http://www.systematics.rdg.ac.uk/spice/
Type 1 RequestType 1 Request
Request:– SearchString, SearchType (scientific name,
common name, unknown), SearchLimit (including higher taxon, maximum number of names to return)
Response:– Number, SpeciesName[0:N]
http://www.systematics.rdg.ac.uk/spice/
Type 2 RequestType 2 Request
Request:– Identifier, GSDIdentifier
Response:– StandardData (approximately the same as
the Standard Data defined by Species 2000 and seen by the user)
http://www.systematics.rdg.ac.uk/spice/
Type 4 RequestType 4 Request
Request:– Identifier, GSDIdentifier
Response:– HigherTaxon[0:N]
http://www.systematics.rdg.ac.uk/spice/
Type 5 RequestType 5 Request
Request:– Identifier, SearchLimit
Response:– Taxon[0:N]
http://www.systematics.rdg.ac.uk/spice/
The “standard data”The “standard data”
This comprises the information about a species which Species 2000 wishes to provide:
– AVCNameWithRefs
– SynonymWithRefs
– CommonNameWithRefs
– Family (or other agreed higher taxon)
– Comment
– Scrutiny
– DataLink (links to the GSD’s or other web pages)
– Geography (list of places)
http://www.systematics.rdg.ac.uk/spice/
Where are we now?Where are we now?
Is the Spice Project finished?– We have a fairly stable CDM (version 1.20 is about
to be replaced with version 1.21)
– XML DTD exists
– Several CGI/XML implementations in Java and PHP, and a Web Service
– We have a working Spice system
– A few changes are anticipated:• geographical information
• linking to further information sources
• infraspecific taxa
http://www.systematics.rdg.ac.uk/spice/
“Intelligent” linking“Intelligent” linking
Species 2000 is – not just a catalogue (which lists things)– It is an index (which points to things)
It plans to provide links to take a user – from a species entry (from a GSD) – to further sources of information about
that particular species (Species Information Sources or SISs)
http://www.systematics.rdg.ac.uk/spice/
“Intelligent” linking“Intelligent” linking
There are experimental “unintelligent” links already (as in the ILDIS GSD), which rely on exact name matching
But there are issues in making links more intelligent
http://www.systematics.rdg.ac.uk/spice/
Data quality (again!)Data quality (again!)
How do we know the information is reliable?
One problem is the differing interpretation of species names (species concepts) in different resources
http://www.systematics.rdg.ac.uk/spice/
LITCHI Project
A rule-based tool for the detection and repair of conflicts and merging of data
in taxonomic databases
http://www.systematics.rdg.ac.uk/spice/
Summary of Litchi project
We modelled the knowledge integrity rules in a taxonomic treatment.
The knowledge tested is implicit in the assemblage of scientific names and synonyms used to represent each taxon.
Practical uses include detecting and resolving taxonomic conflicts when merging or linking two databases.
Version 2 now implemented focusses on the creation of “cross-maps”
http://www.systematics.rdg.ac.uk/spice/
Example 1
Checklist A
Caesalpinia crista L. [accepted name]
Checklist B
Caesalpinia crista L. [accepted name] Caesalpinia bonduc (L.) Roxb. [accepted name] Caesalpinia crista L., p.p. [synonym]
http://www.systematics.rdg.ac.uk/spice/
Example 2Example 2
In the case of the species Cytisus scoparius
Treatment A will list it as Cytisus scoparius (synonym Sarothamnus scoparius)
Treatment B will list it as
Sarothamnus scoparius (synonym Cytisus scoparius)
GenusCytisus
GenusSarothamnus
GenusCytisus
Cytisus scoparius Sarothamnus scopariusCytisus striatus Sarothamnus striatus
Cytisus multiflorus Cytisus multiflorusCytisus praecox Cytisus praecox
Treatment Arecognises one genus, Cytisus
Treatment Brecognises two genera,
Cytisus and Sarothamnus
http://www.systematics.rdg.ac.uk/spice/
Cross-mappingCross-mapping
So how can we make intelligent links work?
One way to make links appear more intelligent is to create and maintain “cross-maps” which describe how one or more taxa in one resource (such as the Species 2000 index) relate to one or more taxa in another resource
http://www.systematics.rdg.ac.uk/spice/
Litchi 2.2 in useLitchi 2.2 in use
Checklist A Checklist B
Rules
Heuristics
Concept relationships
Cross-map
Taxonomic intelligence
Read into system
Write
Conflict detection
Inference of concept relationships
http://www.systematics.rdg.ac.uk/spice/
More about cross-mapsMore about cross-maps
They may be created and maintained– manually by experts– automatically or semi-automatically by
LITCHI (as above)– by monitoring the behaviour of users
following species links– by analysing data sets describing the taxa,
when sufficient such data is available, using the usual species taxonomy tools (phenetic and cladistic analyses)
http://www.systematics.rdg.ac.uk/spice/
More about cross-mapsMore about cross-maps
They may be held– by individual GSDs, describing how to link
their species to selected related resources, as ILDIS has done for linking to the Northern Eurasia (aka USSR) database)
– by Species 2000 as a repository and service to facilitate intelligent species links
– by an “intelligent linking engine”, as planned for Species 2000 Europa to link its two hubs
http://www.systematics.rdg.ac.uk/spice/
A dreamA dream
A system for managing intelligent species links using taxonomic concept relationships would maximise the potential of the plethora of species-based catalogues, indexes and rich species resources currently being assembled all over the world
Perhaps on the Web, as with the current Spice/Species 2000 prototype
Or ...