toward a distributed information system for marine biology and limnology ( aka pakt project)...
TRANSCRIPT
Toward a distributed information system for marine biology and limnology
(aka PAKT project)
Presenting: Karen Stocks, Amarnath Gupta, Chris Condit
Peter Arzberger (PI), Paul Brewin, Li Chen, Heasoo Hwang, Yannis Papakonstantinou, Xufei Qian, Simone Santini, Reza Wahadj, Ilya Zaslavsky
+ Rutgers University, University of Auckland, U. Wisconsin
Funding from the Gordon and Betty Moore Foundation
The Big Challenge:Integrating distributed and heterogeneous
data resources to advance marine ecology and limnology
Opening the “Data Closet”
Lakes Testbed Marine Testbed
Information Technology Development
Seamounts
OBIS
CalCOFI
Seamounts(undersea mountains)
Seamounts are
- biologically unique
- heavily fished habitats
SeamountsOnline: Centralized relational database
Seamount Science Example
Can seamount diversity be predicted from seamount depth, distance from continental margin, geological age, surface productivity, etc.? Does endemism follow the predictions if Island Biogeography Theory?
Seamount Challenges
Combine multiple, distributed datatypes:
• relational species distributions data in SeamountsOnline (seamounts.sdsc.edu)
• bathymetry data and seamount morphology data in the Seamount Catalog (earthref.org)
• raster physical data from World Ocean Atlas, satellite imagery, etc.
Users
Research: CenSeam
– Data Analysis Working Group
– Expedition Planning
Management
– United Nations: IUCN-sponsored workshop on deepwater corals on Seamount
– International Seabed Authority workshop
Seamount Research Coordination Network, NSF
OBIS: Ocean Biogeographic Information System (www.iobis.org)
OBIS
• The Ocean Biogeographic Information System is an international federation of 50+ distributed data providers (7 mil data records) sharing species distribution data
• OBIS has a well established community (secretariat funding, 10 regional node centers, etc.) but limited resources to build infrastructure
• The current DiGIR client-server system allows ~70 fields of data to be transferred (an extended Darwin Core) (www.iobis.org)
OBIS Science Examples
• Evaluating biogeographic provinces with real data
• Predicting the spread of invasive species
• Identifying diversity hotspots/siting marine protected areas
• Evaluating our state of knowledge
OBIS Challenges
• integrate OBIS biological data with emerging physical data resources
• hierarchical data• allow habitat-specific data exploration• extend query functionality (e.g. to complex
spatial queries)• capture more data when registering new data
providers/serve specific communities better
Integrate OBIS biological data with emerging physical data resources
CalCOFI
- CalCOFI (the California Cooperative Ocean Fisheries Investigations) is a 50+ year long monitoring study off of Southern California
- 4 times per year a regular grid of stations is sampled for larval fish, zooplankton, and physical ocean parameters
CalCOFI Science Examples
• Determining scales of variability in biological components in space and time
• Correlating fluctuations in larval fish abundance with physical parameters over time.
• Developing ecosystem models for habitat-based management
Technical Challenges
• Multiple data types: relational, hierarchical, raster, point, voxel, etc.
• Geospatial data operations
• Ontologies
• Higher knowledge sources
Integrating Physical and Biological Oceanographic Data
The Information Systems Viewpoint
What are we integrating and why?• The Science Goals
– Explain biodiversity• Of a species• Of any taxonomic grouping of species• Around a habitat• By correlating distribution of a taxonomic group with the
spatial (temporal) distribution of physical phenomena• By creating groupings of physical and biological parameters
that correlate with the distribution and abundance of species– Perhaps for specific habitats
– Create predictive models• Given physical parameters or habitat characteristics, predict
species distribution and abundance• Given species distribution, predict physical parameters• …
Observations
Organisms
Location
EnvironmentalParameters
Samplestaken-from
CollectionMethod
CollectionSystem
CollectionTarget
Organism-ClassExistence
Organism-ClassAbundance
IndividualOrganism
Partial-mapping
Partial-mapping
Environ.Ontology-k
Environ.Ontology-1
Point-in-space
Surface-in-space
Spatial-Volume
GenericLocationalReference
Of Organisms
OrganismProperties
Time/Frequency
Studies
OrganismClassesOrganismClasses
LocClassesLoc.
Classes
Partial-mapping
collected-for
ReferredObject
GenericEnviron.
Reference Of
Organisms
enviro-location-relationships
spatial relationships
solid annular
Intra-class-relationships(parameterized)
Intra-class-relationships(parameterized)
OrganismProperties
Environmental Region
Properties
A Conceptual Framework for a Global Biodiversity
Schema
Contributions
occur-at
collected-from
observed-at
associated-with
Organism-ClassRel. Abundance
Measurement(data/function)
parameters spatial collection pattern
dense sparse
point
surface
volume
coverage
time/frequency
collectionmetadata
value prob.
scalar vector
resolution
Point-in-space
Surface-in-space
Spatial-Volume
ReferredObject
solid annular
A Conceptual Framework for a Global Physical Oceanography Schema
Phenomena
name properties
view-definition
What are we integrating and why?
• Data elements– The central elements
• Distribution of biological and physical variables– Point distributions– Field distributions– Object-bound distributions
• Grouping of biological and physical variables– Hierarchical groupings– Hypergraph groupings
– Additional elements• Geographic boundaries• Details of observations• Details of habitats and objects therein• …
Point, Field & Object-bound Distributions
• Distributions– Point distributions are sparse
• Continuous distributions– Field distributions are dense
• Often discrete– Object-bound distributions are sparse
• Around objects• Associated with other object-related properties
• Modeling field distributions as arrays– Can be modeled using nested-relational
calculus (algebra) + indices + counting (Libkin 95)
• Special access functions can be useful (Marathe 98)
– Non-uniform field (NUF) distributions: aligned-arrays with nulls
• NRC + indices + counting + list operations• Dimension transformation + interpolation
– Containment vs. overlap semantics
We are yet to show the relationship between Map Algebra and Array Algebra
Integration of Point with NUF Distribution Data Sources
• Some issues– Value AT POINT queries– Neighborhood queries
• Two possible “join” semantics– “snap” points to array-cells– “regrid” arrays to point resolution with interpolation
• Planning the joins in a mediator– Scenario
• A prior sub query selects a set of points P• Another prior subquery selects a set of array cells by condition C• Find value of function F for the points at the corresponding cells
– Solutions• Get P and C-result at the mediator and compute F at the mediator• Collect the set P at the mediator, call function F on array with condition C for
each element of P• Send an array indexing function to point source and return indexes, and
perform an indexed selection from array source– Not implemented yet
The General Integration Problem
• Sources need to export different data models– Different algebras– Semantics of structures– Semantics of values– Constraints among values and domains
• How do we register this information?• What combined algebra does the mediator support?• How do we control addition of newer sources?• How does this work in the GAV or GLAV integration
framework?• How do we include type and structure transformations,
and domain-specific value-association as part of the mediation process?
The Current Integration Framework
• Some Decisions– All data are “relationalized”– Algebraic operations are implemented on top of relational
sources as functions– Functions are modeled in the BIRN mediator as relations with
binding patterns– Popular native formats like OpenDAP are semantically too
heterogeneous and has poor query capabilities• Value based queries are disallowed• We need to augment the registration mechanism to (semi-
automatically) ingest all metadata• We will ingest the data and store it relationally in a network-
accessible relational system
– Will consider the problems of adding vector-data and unaligned array data as a next step
The Demonstration• The global schema
The marked tables are augmented with physical parameters from the World Ocean Atlas – over two different grids
Technology Overview
• Microsoft ASP.NET
• Asynchronous Javascript and XML (AJAX)
• Google Maps
Google Maps
• Pros– Intuitive U.I.– Bathymetry– Simple Javascript API– Speed– Cost
• Cons– Google dependant– Data volume limitation
• Alternatives Under Consideration– ESRI ArcGIS Server– 3D Client (ArcGlobe, GoogleEarth, WorldWind)– Some combination
Data Sources
• SeamountsOnline– Biological Oceanography Information
• World Ocean Atlas– Physical Oceanography Information
• Biological and Physical Combination
Next Steps
• Interface Refinement
• Apply learning to OBIS
• Questions?
Contact Information
• Amarnath Gupta ([email protected])
• Karen Stocks ([email protected])
• Chris Condit ([email protected])