abstract

1
Abstract Integration of hydrologic parameter ontology in CUAHSI HydroCatalog Ilya Zaslavsky 1 , David Valentine 1 , Thomas Whitenack 1 , Michael Piasecki 2 , Richard Hooper 3 , Yoori Choi 3 , David Maidment 4 1 San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA, United States; 2 Drexel University, Philadelphia, PA, United States; 3 CUAHSI Central Office, Boston, MA, United States; 4 DUniversity of Texas, Austin, TX, Unit States Nomenclatures of hydrologic parameters are large and very fragmented. One of the key goals of the CUAHSI Hydrologic Information System project (http://his.cuahsi.org) is to unify semantically diverse hydrologic observations and organize them so the data can be easily discovered, accessed and analyzed in different types of research scenarios, by different types of users. The core of the system is a hydrologic metadata catalog, which describes observational data available from multiple repositories via a standard set of CUAHSI water data web services. To address needs of different types of users, the HydroCatalog is being designed as a multi-level information system. At the lower level, a CUAHSI HIS time series catalog contains metadata about 23.3 million time series from government and academic data sources (hiscentral.cuahsi.org). The time series representation organized by primary data sources is suitable for hydrologists and data managers who need to discover and access hydrologic observations in a format they were published, without additional interpretations or data conversions. However, such a representation doesn’t fully address data discovery and access needs of hydrologic analysts and modelers who prefer to work with curated and interpreted hydrologic data collections organized by thematic categories. Therefore, an additional layer of commonly requested hydrologic data products (“hydrologic themes”) is being constructed, where a theme represents a derived spatio-temporal aggregation of observational data. Information supporting semantics-based discovery is needed at both levels of the HydroCatalog. At the time series catalog level, the focus is on discovery of observations based on a community- curated hierarchy of hydrologic concepts, on associating variables with these concepts, and on translating concepts-based queries into queries specific to individual sources of primary data. At the theme catalog level, the variable-concept associations are used to group time series into “data carts”, which are the basis for generating hydrologic themes; thus the main issue is recording themes’ semantic provenance and supporting reconciliation of units, time support and other characteristics that prepare a theme for visualization or modeling use. We describe the organization of semantic information in the CUAHSI HydroCatalog, introduce software tools for managing hydrologic parameter ontology, and present initial results of concept- variable tagging. In particular, we discuss the results of using a hydrologic concept hierarchy based on the USGS and EPA Substance Registry System (SRS) for tagging hydrologic parameters in the metadata catalog. Currently, over 2000 catalog variables are available for concept-based search, primarily from observations made in water or suspended sediment. Additional work is needed for tagging variables in other mediums, and for managing concept-variable mapping as concept hierarchy evolves. Conclusion The CUAHSI HIS HydroCatalog has been efficient in supporting semantics-based search over 23.3 million time series representing over 60 observation networks with different variable semantics. The concept-based query is supported by CUAHSI concept hierarchy which is composed of several vocabularies, and mappings between source-specific vocabularies and leaf concepts in the hierarchy. The project is working on setting up a community-focused ontology management system, based on semantic wiki, to enable crowd-sourcing of further ontology enhancement. This HISCentral web application is used to associate variables in submitted datasets with terms in a hydrologic concept hierarchy, to support concept-based search Semantic annotation and search What is HydroCatalog At the general level, CUAHSI HIS includes three key components: data publication platform (HydroServer); data discovery and integration platform (HydroCatalog) and a data synthesis and research platform (represented by HydroDesktop). Current catalog content: 60+ public services 18,000+ variables 1.96 million sites 23.3 million series 5.1 billion data values Internally, HydroCatalog consists of services that are responsible for harvesting hydrologic metadata from registered services, managing ontology and variable-ontology mappings, monitoring, logging and validation of services, and supporting a query API. Number of data requests brokered by the catalog has been growing. Federal agency data services at HISCentral Hydrologic Ontology While syntactic heterogeneity is managed by water data being described using ODM and WaterML, and accessed via uniform Web services, semantic differences across observation networks require a different approach. About CUAHSI The Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) is an organization representing 120+ universities in the US and 11 international affiliates. As part of its mission, CUAHSI supports the development of cyberinfrastructure for the hydrologic sciences. The CUAHSI HIS (Hydrologic Information System) project is a multi-year multi-institution effort focused on consistent management and integration of observational data available from several federal agencies (USGS, EPA, USDA, NOAA, etc.) as well as published by academic investigators. Towards Distributed HydroCatalogs HydroSphere Organization of the CUAHSI HIS concept hierarchy (the case of Nitrogen) Permissi ble Search Keywords Tagging Targets Substance Registry System NWIS unique parameter codes with associated period of record Unmapped Variable s Mapped Variables: 3567 Don’t have data Have data 4339 Variables found in the catalog dump, tagged, and have data 2103 2236 Not tagged yet: Mostly variables in mediums other than water or suspended sediment 9178 7218 Catalog variables not in SRS: taxonomic IDs, set number and similar metadata, context observations, surrogate measures, some organics… UTexas Catalog Series Metadata Data UTexas Services University of Texas US Geological Survey NWIS Catalog Series Metadata Data NWIS Services HIS Central Catalog Series Metadata Data HIS Central San Diego Supercomputer Center CSW MetaCatalog HydroDesktop The CUAHSI concept hierarchy is stored in SQL Server databases as a set of four primary tables: Concepts: contains the entire list of concepts Synonyms: concepts with equivalent definitions to terms that exist in the Concepts table Hierarchy: maintains the parent/child relationships between the concepts ConceptPaths: derived from the Concepts and Hierarchy tables to create a “conceptPath” attribute for each concept to simplify determining the upstream/downstream lineage for each concept CUAHSI HIS is an online distributed system to support the sharing of hydrologic data from multiple repositories and databases via standard water data service protocols; software for data publication, discovery, access and integration. HICentral Web Service GetWordList GetOntologyConceptCode GetOntologuKeyword GetOntologyTree GetMappedVariables GetSearchableConcepts SetSeriesCatalogForBox GetServicesInBox GetSitesInBox Water data web services are registered at the Central HIS service registry. The HISCentral application harvests observation metadata from the service (sites, variables, and periods of record that are accessed via the service) at regular intervals and appends it to the central metadata catalog. In addition, HISCentral supports semantic tagging of the registered data, by associating the harvested variables with concepts from hydrologic concept hierarchy. HISCentral web service enables data discovery by HIS client applications. The current version of the CUAHSI HIS concept hierarchy includes 4095 concepts (3999 leaf concepts), which are organized into three major groups: physical, chemical and biological parameters. It incorporates concepts from several sources, including the EPA/USGS Substance Registry System (SRS) and biological nomenclatures. The hierarchy is visualized as an Inxight Startree. Matching the content of the USGS National Water Information System catalog with SRS concepts in the CUAHSI HydroCatalog The content of HydroCatalog, including the concept hierarchy and semantic mappings, is exposed via HISCentral Web Service. They can use by applications such as HydroDesktop to discover time series based on concepts Vocabularies used by each data source, are matched up with a common controlled vocabulary. In the process of water data service registration, variable names in each source are associated with concepts in the concept hierarchy. This provides for semantics-aware data discovery and integration regardless of naming conventions or synonyms used by individual sources. Query expansion based on conceptID-variableID mappings CUAHSI HydroCatalog is evolving towards compliance with OGC Catalog Services for the Web (CSW) specifications. Water data service are exposed as Web Feature Services (WFS), which contain time series information. They are registered in ESRI’s GeoPortal, which supports browsing and querying the catalog via CSW methods. In turn, this allows us to develop the HydroCatalog into a distributed system of HydroCatalogs, which can harvest service information from each other. An experimental HydroPortal registering WFS services from HISCentral The planned system of distributed HydroCatalogs Project web site: http://his.cuahsi.org HISCentral: http://hiscentral.cuahsi.org Visualization of the current concept hierarchy: http://hiscentral.cuahsi.org/startree.aspx Links: IN41C-1367

Upload: niles

Post on 23-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Integration of hydrologic parameter ontology in CUAHSI HydroCatalog. HydroSphere. CSW. IN41C-1367. HICentral Web Service. Ilya Zaslavsky 1 , David Valentine 1 , Thomas Whitenack 1 , Michael Piasecki 2 , Richard Hooper 3 , Yoori Choi 3 , David Maidment 4 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Abstract

Abstract

Integration of hydrologic parameter ontology in CUAHSI HydroCatalog

Ilya Zaslavsky1, David Valentine1, Thomas Whitenack1, Michael Piasecki2, Richard Hooper3, Yoori Choi3, David Maidment4

1San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA, United States; 2Drexel University, Philadelphia, PA, United States; 3CUAHSI Central Office, Boston, MA, United States; 4DUniversity of Texas, Austin, TX, United States

Nomenclatures of hydrologic parameters are large and very fragmented. One of the key goals of the CUAHSI Hydrologic Information System project (http://his.cuahsi.org) is to unify semantically diverse hydrologic observations and organize them so the data can be easily discovered, accessed and analyzed in different types of research scenarios, by different types of users. The core of the system is a hydrologic metadata catalog, which describes observational data available from multiple repositories via a standard set of CUAHSI water data web services. To address needs of different types of users, the HydroCatalog is being designed as a multi-level information system. At the lower level, a CUAHSI HIS time series catalog contains metadata about 23.3 million time series from government and academic data sources (hiscentral.cuahsi.org).  The time series representation organized by primary data sources is suitable for hydrologists and data managers who need to discover and access hydrologic observations in a format they were published, without additional interpretations or data conversions. However, such a representation doesn’t fully address data discovery and access needs of hydrologic analysts and modelers who prefer to work with curated and interpreted hydrologic data collections organized by thematic categories. Therefore, an additional layer of commonly requested hydrologic data products (“hydrologic themes”) is being constructed, where a theme represents a derived spatio-temporal aggregation of observational data. Information supporting semantics-based discovery is needed at both levels of the HydroCatalog. At the time series catalog level, the focus is on discovery of observations based on a community-curated hierarchy of hydrologic concepts, on associating variables with these concepts, and on translating concepts-based queries into queries specific to individual sources of primary data. At the theme catalog level, the variable-concept associations are used to group time series into “data carts”, which are the basis for generating hydrologic themes; thus the main issue is recording themes’ semantic provenance and supporting reconciliation of units, time support and other characteristics that prepare a theme for visualization or modeling use. We describe the organization of semantic information in the CUAHSI HydroCatalog, introduce software tools for managing hydrologic parameter ontology, and present initial results of concept-variable tagging. In particular, we discuss the results of using a hydrologic concept hierarchy based on the USGS and EPA Substance Registry System (SRS) for tagging hydrologic parameters in the metadata catalog. Currently, over 2000 catalog variables are available for concept-based search, primarily from observations made in water or suspended sediment. Additional work is needed for tagging variables in other mediums, and for managing concept-variable mapping as concept hierarchy evolves.  

ConclusionThe CUAHSI HIS HydroCatalog has been efficient in supporting semantics-based search over 23.3 million time series representing over 60 observation networks with different variable semantics. The concept-based query is supported by CUAHSI concept hierarchy which is composed of several vocabularies, and mappings between source-specific vocabularies and leaf concepts in the hierarchy. The project is working on setting up a community-focused ontology management system, based on semantic wiki, to enable crowd-sourcing of further ontology enhancement.

This HISCentral web application is used to associate variables in submitted datasets with terms in a hydrologic concept hierarchy, to support concept-based search

Semantic annotation and search

What is HydroCatalog

At the general level, CUAHSI HIS includes three key components: data publication platform (HydroServer); data discovery and integration platform (HydroCatalog) and a data synthesis and research platform (represented by HydroDesktop).

Current catalog content: 60+ public services 18,000+ variables 1.96 million sites 23.3 million series 5.1 billion data values

Internally, HydroCatalog consists of services that are responsible for harvesting hydrologic metadata from registered services, managing ontology and variable-ontology mappings, monitoring, logging and validation of services, and supporting a query API.

Number of data requests brokered by the catalog has been growing.

Federal agency data services at HISCentral

Hydrologic Ontology

While syntactic heterogeneity is managed by water data being described using ODM and WaterML, and accessed via uniform Web services, semantic differences across observation networks require a different approach.

About CUAHSIThe Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) is an organization representing 120+ universities in the US and 11 international affiliates. As part of its mission, CUAHSI supports the development of cyberinfrastructure for the hydrologic sciences. The CUAHSI HIS (Hydrologic Information System) project is a multi-year multi-institution effort focused on consistent management and integration of observational data available from several federal agencies (USGS, EPA, USDA, NOAA, etc.) as well as published by academic investigators.

Towards Distributed HydroCatalogs

HydroSphere

Organization of the CUAHSI HIS concept hierarchy(the case of Nitrogen)

PermissibleSearch

Keywords

Tagging Targets

Substance RegistrySystem

NWIS unique parameter codes with associated period

of record

Unmapped Variables

MappedVariables: 3567

Don’t have data Have data

4339

Variables found in the catalog dump, tagged, and have data2103

2236

Not tagged yet: Mostly variables in mediums other than water or suspended sediment

9178 7218Catalog variables not in SRS: taxonomic IDs, set number and similar metadata, context observations, surrogate measures, some organics…

UTexasCatalogSeries Metadata

Data

UTexasServices

University of Texas US Geological Survey

NWIS CatalogSeries Metadata

Data

NWISServices

HISCentralCatalogSeries Metadata

Data

HISCentral

San Diego Supercomputer Center

CSWMetaCatalog

HydroDesktop

The CUAHSI concept hierarchy is stored in SQL Server databases as a set of four primary tables:• Concepts: contains the entire list of concepts• Synonyms: concepts with equivalent definitions to terms that

exist in the Concepts table• Hierarchy: maintains the parent/child relationships between

the concepts• ConceptPaths: derived from the Concepts and Hierarchy

tables to create a “conceptPath” attribute for each concept – to simplify determining the upstream/downstream lineage for each concept

CUAHSI HIS is an online distributed system to support the sharing of hydrologic data from multiple repositories and databases via standard water data service protocols; software for data publication, discovery, access and integration.

HICentralWeb Service

GetWordListGetOntologyConceptCodeGetOntologuKeywordGetOntologyTree

GetMappedVariablesGetSearchableConceptsSetSeriesCatalogForBoxGetServicesInBoxGetSitesInBox

Water data web services are registered at the Central HIS service registry. The HISCentral application harvests observation metadata from the service (sites, variables, and periods of record that are accessed via the service) at regular intervals and appends it to the central metadata catalog. In addition, HISCentral supports semantic tagging of the registered data, by associating the harvested variables with concepts from hydrologic concept hierarchy. HISCentral web service enables data discovery by HIS client applications.

The current version of the CUAHSI HIS concept hierarchy includes 4095 concepts (3999 leaf concepts), which are organized into three major groups: physical, chemical and biological parameters. It incorporates concepts from several sources, including the EPA/USGS Substance Registry System (SRS) and biological nomenclatures. The hierarchy is visualized as an Inxight Startree.

Matching the content of the USGS National Water Information System catalog with SRS concepts in the CUAHSI HydroCatalog

The content of HydroCatalog, including the concept hierarchy and semantic mappings, is exposed via HISCentral Web Service. They can use by applications such as HydroDesktop to discover time series based on concepts

Vocabularies used by each data source, are matched up with a common controlled vocabulary. In the process of water data service registration, variable names in each source are associated with concepts in the concept hierarchy. This provides for semantics-aware data discovery and integration regardless of naming conventions or synonyms used by individual sources.

Query expansion based on conceptID-variableID mappings

CUAHSI HydroCatalog is evolving towards compliance with OGC Catalog Services for the Web (CSW) specifications. Water data service are exposed as Web Feature Services (WFS), which contain time series information. They are registered in ESRI’s GeoPortal, which supports browsing and querying the catalog via CSW methods. In turn, this allows us to develop the HydroCatalog into a distributed system of HydroCatalogs, which can harvest service information from each other.

An experimental HydroPortal registering

WFS services from HISCentralThe planned system of distributed HydroCatalogs

Project web site: http://his.cuahsi.org HISCentral: http://hiscentral.cuahsi.org Visualization of the current concept hierarchy: http://hiscentral.cuahsi.org/startree.aspx

Links:

IN41C-1367