Semantic Metadata for Scientific Data Access and Management
Richard M. Keller, Ph.D.Group Lead for Information Sharing & Integration
Intelligent Systems DivisionNASA Ames Research Center
http://sciencedesk.arc.nasa.gov/scidesk/
February 17, 2005ROSES Workshop
Focus of Work
• Scientific data management, not data analysis
• Computational infrastructure related to:
• storing
• locating
• searching
• integrating
• sharing
scientific data
Specific Problems
• Integrating Heterogeneous Scientific Data from Multiple Sources
• Searching/finding Relevant Scientific Data
• Organizing/indexing Data for Rapid, Intuitive Access
Culprit: Inadequate Metadata
• Metadata is typically limited to essentials only (e.g. data format, instrument, date)
– inadequate for extensive indexing, precise searching
• Each data repository defines its own metadata, using its own terminology and data dictionary
– difficult to search across repositories
– difficult to integrate and combine datasets
• No common frame of reference for cross-repository comparison
Common Approach
To facilitate storage, retrieval, integration, and comprehension of scientific data:
capture the
semantic metadata
that provides a rich context for each data product
What is “semantic metadata”?
Semantic Metadata:
information relating to the context in which the scientific data are generated and used
– how?
– when?
– where?
– why?
– who?
Collection of microbial mats in the field
Early Microbial Ecosystems Investigation
Trace gas production and consumption under
“Early Earth” conditions
Greenhouse Incubator
Microbial mat (algae)
Detailed studies of mat biogeochemistry
• monitoring• analysis• experimentation
geographically-disbursedteam of collaborators
B. BeboutD. Des MaraisT. Hoehler, et al.Code SSX
Semantic Context Surrounding Mat “4b” (“Semantic Network”)
collected-at
Spring Beach
collected-by
Brad Bebout
stored-in
Greenhouse
has-measurement
measured-with
O2 Microsensor
O2 Concentration
HBC-2 Microbialculture
Culture prep B notes for Lee
Culture prep B notes for Lee
has-culture
cultivated-by
CulturerecipeMary Hogan
has-recipe
imaged-with
Electron Microscope
has-image
Semantic Network Structure
culture
photo
measurement
siteinstrument
sample
hypothesis
• Links: relationships among resources (e.g.,“measured by”, “supports hypothesis”)
• Attached files: electronic products associated with resources (e.g., datasets, images, documents)
• Attributes: properties of resources (metadata)
• Nodes: key info resources or organizational structures (describes people, places, measurements, hypotheses)
• date• size• format
Ontology:Specifies the
types of nodes, attributes and
links defined for scientific
investigation
Rules:Add/modify nodes, links & attributes in the network
DNA sequenceimage
document
culture
personsample
photographic image
SEM image
Scientific Data Collection Ontology (partial)
other
experiment
Scientific Information Nodes
project
measurement
site
equipment
camera
gas chromatograph
stub
O2 microsensor
N2 microsensorSEM
O2 concentration
N2 concentration
spectrometer
spectrograph
chromatogram
other
other
micrograph
cultivated-fromcultivated-by
has-genetic-sequence
pictured-in
researcher
lab tech
Benefits of Semantic Metadata Approach
• Semantic context provides a unifying framework for integrating data across data collections
• Sophisticated “semantic search” methods allow retrieval based on semantic relationships among data
• Intuitive data indexing, access, and organization schemes derive from semantic data models
• Formal semantic representation enables automated inference about the data
Challenge
• Semantic metadata approach has been applied to small, PI-maintained data repositories
• Tremendous volume of earth and space science data is stored in huge, curated data repositories maintained by NASA, USGS, ESA, universities, and others.
• How to translate semantic metadata ideas to operate on the scale of large data repositories?
Seeking Collaborators!
SemanticOrganizer System(Mat Sample: Spring-M4-b)
Photo: SprM4b excised
What is ScienceOrganizer?
• A Web-based collaborative knowledge management tool for distributed teams of scientific investigators
• Facilitates information sharing, integration, correlation• A project information repository / digital library: users upload/download heterogeneous project information products -- images, datasets, documents, and various types of scientific records (describing samples, field sites, measurements, instruments, etc.)
• Features cross-linkage: enables rapid access to interrelated information; permits linking data and observations to scientific hypotheses
• Supports inference capabilities: permits formal reasoning about the repository contents
• A “project archive” system: tracks history of project team’s fieldwork, labwork, and associated data collection activities
ScienceOrganizer Users
• ARC Microbial Ecosystems Group: field & lab science, experiments, data analysis.
• NAI Ecogenomics Focus Group: cross-discipline collaboration, data analysis.
• ARC Electron Microscopy Lab: electron microscopy image archiving, sample cataloging.
• MARTE Mission: analog Mars drilling mission, support for remote science data acquisition, storage, and access
• JSC Astrobiology Institute for the Study of Biomarkers: electron microscopy image archive, sample collection, cataloging, and storage; support for education & outreach.
• NIH/NASA Malaria Control Study: African malaria study - data collection and archiving.
• ASU/NSF Desert Microbial Survey (NSF): microbial survey; provides publicly-accessible repository.
• Mobile Agents Demonstration Project: analog Mars surface exploration, support for remote science data acquisition, storage, and access
• Astrobionics Technology Integration: technology infusion program