http://knb.ecoinformatics.org science environment for ecological knowledge: ecogrid matthew b....

22
http://knb.ecoinformatics.org http://seek.ecoinformatics.org Science Environment for Ecological Knowledge: EcoGrid Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Upload: milton-solomon-stephens

Post on 31-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

http://knb.ecoinformatics.org http://seek.ecoinformatics.org

Science Environment for Ecological Knowledge: EcoGrid

Matthew B. JonesNational Center for Ecological Analysis and Synthesis

University of California Santa Barbara

Science Environment for Ecological Knowledge

Research Objectives

Access to ecological, environmental, and biodiversity data Enable data sharing & re-use Enhance data discovery at global scales

Scalable analysis and synthesis Taxonomic, Spatial, Temporal, Conceptual integration of data

Address data heterogeneity issues Enable communication and collaboration for analysis Enable re-use of analytical components

Collaborators NCEAS, UNM, SDSC, U Kansas Vermont, Napier, ASU, UNC

SEEK Components

Science Environment for Ecological Knowledge

Kepler Modeling scientific workflows

EcoGrid Making diverse environmental data systems interoperate

Semantic Mediation System “Smart” data discovery and integration

Knowledge Representation WG Taxon WG BEAM WG Education, Outreach, Training

Scientific Workflows

Model the way scientists work with their data now Mentally coordinate export and import of data among software

systems

Workflows emphasize data flow

Output generation includes creating appropriate metadata The analysis workflow itself becomes metadata The workflow describes the data lineage as it has been

transformed Derived data sets can be stored in EcoGrid with provenance

Query EcoGrid to find data

Archive output to EcoGrid with workflow

metadata

Kepler: scientific workflows

• Collaborative effort of SEEK, SciDAC/SDM, GEON, Ptolemy Project

Kepler understands EML data

Kepler: molecular biology example

SEEK EcoGrid

Goal: allow diverse environmental data systems to interoperate

Hides complexity of underlying systems using lightweight interfaces

We have standardized data via EML, need standard APIs Integrate diverse data networks from ecology, biodiversity, and

environmental sciences

Data systems Any system can implement these interfaces Prototyping using:

Metacat, SRB, DiGIR, Xanthoria, etc.

Supports multiple metadata standards EML, Darwin Core as foci

EcoGrid client interactions

Modes of interaction Client-server Fully distributed Peer-to-peer

EcoGrid Registry Node discovery Service discovery

Aggregation services Centralized access Reliability Data preservation

EcoGrid Query Interfaces

Provides a mechanism for search and retrieval of metadata and federated data

Supports third party interaction with search results – forwarding of result set identifiers to another service instance for retrieval

Different levels of compliance Low barrier for participation Bulk of data will be accessible through Type I

ResultQuery

Query Interfaces Implemented

Initial prototype to support query and retrieval from: Storage Resource Broker (SRB) Metacat Distributed Generic Information Retrieval (DiGIR) Xanthoria

Encourage additional experimentation with and feedback based on other system implementations

EcoGrid Query Level I

Basic, entry level exposure of data and metadata for EcoGrid and SEEK

Response contains data – intended for direct communications rather than 3rd party indirection

ResultsetType query(SessionID,QueryType)

byte[] get(SessionID,objectID)

Result Query

Query Conditions

Language independent representation of a query structure

Transformed into the appropriate native language of the data store

Example:<AND> <condition operator="LIKE“ concept="ScientificName">peromyscus%

</condition> <condition operator="NOT EQUALS“

concept="DecimalLatitude">NULL</condition>

</AND>

Query

Specifying the Resultset

Specify the list of concepts (fields) to be returned in the resultset

Simple paths used to identify elements or document subtrees

Effectively flattens the structure of the records, but allows generic representation

Example: <returnfield>/ScientificName</returnfield>

<returnfield>/Longitude</returnfield>

<returnfield>/Latitude</returnfield>

Query

Full Query Example

<egq:query queryId="query-digir.1.1" system="http://knb.ecoinformatics.org"

xmlns:egq="ecogrid://ecoinformatics.org/ecogrid-query-1.0.0beta1"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-

query-1.0.0beta1 ../../src/xsd/query.xsd"> <namespace

prefix="darwin">http://digir.net/schema/conceptual/darwin/2003/1.0</namespace>

<returnfield>/ScientificName</returnfield> <returnfield>/Longitude</returnfield> <returnfield>/Latitude</returnfield> <title>Peromyscus genus query</title> <condition operator="LIKE"

concept="Genus">Peromyscus</condition></egq:query>

Query

Query Result Set Structure

<rs:resultset resultsetId="foo.1.1" system="urn:not://sure/what/to/put/here" xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1

../../src/xsd/resultset.xsd">

<resultsetMetadata> <sendTime>2003-05-02T16:45:50-09:00</sendTime> <startRecord>1</startRecord> <endRecord>2</endRecord> <recordCount>2</recordCount> <namespace>http://digir.net/schema/conceptual/darwin/2003/1.0</namespace> <system id="1">http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC2</system> </resultsetMetadata>

<record number="1" system="1" identifier="mvz1"> <returnField name="ScientificName">PEROMYSCUS LEUCOPUS NOVEBORACENSIS</returnField> <returnField name="Longitude">100</returnField> <returnField name="Latitude">200</returnField> </record> …</rs:resultset>

Result

EcoGrid Query Level II

More detailed handling of results Uses RSIDs to identify resultsets- handles

that can be passed to a third party

RSID search(SessionID,query)

Resultset retrieve(SessionID,RSID,start,numrecs)

query decodeResultsetIdentifier(SessionID,RSID)

statusinfo getResultStatus(SessionID)

int transfer(SessionID,sourceURL,destURL,ObjectID)

EcoGrid Write

Used to push data back to sources (e.g. publishing EML documents)

Depends on the availability of an authentication and access control system

put(sessionID, objectID, object, type)

delete(sessionID,objectID)

Data Instance Query

New requirement to support direct query and retrieval with arbitrary data sets

Generally no common schemas between different instances

Could either Push data instance to service that can query object (e.g.

the SRB) Implement interface at the data instance location

Simple JDBC / SQL interface?

dbSchema getDataSchema(sessionID,objectID)

dbResultset search(sessionID,objectID,SQL)

Building the EcoGrid

AND

LUQ

NTL

Metacat node

Legacy system

LTER Network (24) Natural History Collections (>> 100)Organization of Biological Field Stations (180)UC Natural Reserve System (36)Partnership for Interdisciplinary Studies of Coastal Oceans (4)Multi-agency Rocky Intertidal Network (60)

SRB node

DiGIR node

VCR

VegBank node

Xanthoria node

HBR

Metadata-driven analysis cycle

Acknowledgements

This material is based upon work supported by:

The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676.

The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.

The Andrew W. Mellon Foundation.

PBI Collaborators: NCEAS, University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research)