1 large-scale data management challenges federating climate, water, and weather data...

40
1 Large-Scale Data Management Challenges Federating Climate, Water, and Weather Data Repository/Workspace Workshop 20-21 September 2010 Kenneth Galluppi Director, Disaster and Environmental Programs Renaissance Computing Institute University of North Carolina at Chapel Hill

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

1

Large-Scale Data Management ChallengesFederating

Climate, Water, and Weather Data

Repository/Workspace Workshop20-21 September 2010

Kenneth GalluppiDirector, Disaster and Environmental Programs

Renaissance Computing InstituteUniversity of North Carolina at Chapel Hill

Outline

• Environmental Problem• Use Case

– Climate and Weather– Hydrology

• Data Grid/Workspace Use Cases• Answer Peter’s Questions

3

Environmental Science Problems

• Enable cutting edge, Grand Challenge multidisciplinary science through the federation of data-grids of climate, water, and weather data, with other geospatially and socially relevant datasets.

– Understanding of regional impacts of climate change on water availability and society trends

– Understanding and prediction of catastrophic weather-driven events under climate change

– Communicate risk/crisis knowledge non-specialists

4

Challenges of Data• Integration of Large, Multidisciplinary Datasets

– NCDC and NOAA Centers, SDSC, and others– Discover, access, integration, utility [not store/retrieve]

• Linkage of Datasets to Computational Models– Input/outputs for real-time model forecasting– Model-to-observation comparison– Climatic models for reanalysis and prediction

• Access to Large Reference Data– Climate Reanalysis Datasets, 1 PetaByte– NWS DataCube for aviation and emergencies

5

Collaboration and Datagrids

National ClimaticData Center

Emergency ManagementResearch Program

Federal Agencies Academic Research

140 universites

NOAA Mission:

Weather & Water Serve Society’s Needs for Weather and Water Information

Ecosystems Protect, Restore, and Manage the Use of Coastal and Ocean Resources through an Ecosystem Approach to Management

Climate Understand Climate Variability and Change to Enhance Society’s Ability to Plan and Respond

To understand and predict changes in Earth’s environment and conserve and manage coastal and marine resources to meet our nation’s economic, social, and environmental needs

National Climatic Data CenterNational Climatic Data Center

Commerce & Transportation Support the Nation’s Commerce with Information for Safe,Efficient, and Environmentally Sound Transportation

Mission Support Provide Critical Support for NOAA’s Mission

NOAA Goals:

Data Supports NOAA/NCDC Mission

The National Environmental Data ArchiveThe National Environmental Data Archive

ClimateAnalysis

RADAR

Satellite

Other

Comprehensive Large Array-data Stewardship System (CLASS) Storage

(reanalysis)

The National Environmental Data ArchiveThe National Environmental Data Archive

NOAA CLASS

• Large Structured data• Propriety• Doesn’t interface with

HPSS• Climate Support of

products and services• Does well, what it does

NOAA’s Data Centers Will Function in a NOAA’s Data Centers Will Function in a Wider Information LandscapeWider Information Landscape

NCDCNCDC NGDCNGDC

NODCNODCNSOFNSOF

NOAA’s Data Centers Will Function in a NOAA’s Data Centers Will Function in a Wider Information LandscapeWider Information Landscape

NOAA’s Data Centers Will Function in a NOAA’s Data Centers Will Function in a Wider Information LandscapeWider Information Landscape

ORNL,ESG

NSF DataNet

DAPs Data Mgmt

IPCC International Sources

International Sources

NEAAT

Climate Services using Federated DB’sNOAA’s Data Centers will need to provide access to petabytes of data that are distributed across multiple NOAA facilities

Be able to integrate these data with data from other disciplines (environmental, biological, social, etc..) that are distributed on other databases both in the public and private sector domain

Export data to common data formats - Shapefile, Well-Known Text, Arc/Info ASCII GRID, Gridded and Raw NetCDF, GeoTIFF and KMZ (Google Earth)

Coordinated, efficient,integrated, interoperable

Data Systems

Space Observations

Ocean Observations

Land Surface Observation

Atmospheric Observations

Discipline-Specific View Whole-System View

Current systems are program-specific, focused, individually efficient.But incompatible, not integrated, isolated from one another and from wider environmental community

1

Support :Disaster reductionHuman HealthClimateWater ResourcesWeatherOcean ResourcesAgriculture & Land-UseEcosystems

NOAA/NCDC Climate Services

Data supports NOAA/NCDC Mission

• NCDC will need to function in a wider information landscape with a NOAA Federated Archive (6 data centers)– Support distributed data management and services

• Interoperable with DataNet, Earth System Grid, GEO-IDE, EOSDIS, etc.– netCDF, LDM, CF conventions, ISO 19115-2

• Move out of the Box and into the Cloud (networked)– Utilize highly distributed storage and computing (RENCI, Oak Ridge

National Lab

• Implement supporting technologies to enable interoperability with Designated Communities (OGC, WMS/WFS)

• Institute rules-based data management to enable true federation of NOAA Centers of Data – iRODS

16

NCDC-RENCI Potential Use Cases

• Catastrophic Event Modeling and Observations• Climate Reanalysis Datasets

– Climate records everywhere, for 30 years– 1-PetaByte– Regional and local sub-setting– Ten’s of thousands of users

• Multi-sensed Gridded Precipitation Climatology• Extreme Event Climatology• Green Energy, physical-social science Integration

17

High Level View of HIS Service Oriented ArchitectureAs of October 2009, 1,867,108 sites and4,336,790,286 data values where available through the HIS from federal, state, and academic data providers.

There have been 543,144 “GetValues” data requests from Feb 2008 to Oct 2009 .

http://his.cuahsi.org

Discovery

Hydroseek

Analysis

MATLAB , Excel , GIS, R, …

Modeling

Web services and WaterMLto transmit hydrologic

data in a standard way

GetSitesGetSiteInfoGetVariableInfoGetValues

Access

HydroDesktop

HIS Centralcatalog hydrologic data and metadatastore and

share hydrologic data

HIS Server

ODM

Observatories

publication and archival

of field data

3rd- Party Servers

include data from

others

HydroModeler

Hydrology Community

CUAHSI HISThe CUAHSI Hydrologic Information System (HIS) is an internet based system to support the sharing of hydrologic data. It is comprised of hydrologic databases and servers connected through web services as well as software for data publication, discovery and access.

Data Discovery and Integration platform

Data Publication platform

Data Synthesis and Research platform

Data Services

Metadata Services Metadata Search

HIS Central

HydroDesktopHydroServer

Service registr

ation

Catalog harvesting

Service and data theme metadata

Data carts

Water Data ServicesSpatial Data Services

Like search portals Google, Yahoo, Bing

Like browsersLike web servers

Like HTML

19

HIS Service Oriented Architecture

Publication of Point Observations

• Observations Data Model (ODM)– ODM Tools– ODM Data Loader– ODM Streaming Data Loader– ODM Controlled

Vocabularies

• WaterOneFlow web services– Data are transmitted in

WaterML format

Dynamic Controlled Vocabulary Moderation System

Local ODMDatabase

Master ODM Controlled Vocabulary

HIS CV Website

ODM ControlledVocabulary Moderator

ODM Data Manager

ODMControlled Vocabulary

Web Services

ODM Tools

Local Server

XMLXML

http://his.cuahsi.org/mastercvreg.html

Horsburgh, J. S., D. G. Tarboton, D. R. Maidment and I. Zaslavsky, (2008), A Relational Model for Environmental and Water Resources Data, Water Resour. Res., 44: W05406, doi:10.1029/2007WR006392.

CUAHSI Observations Data Model http://his.cuahsi.org/odmdatabases.html

Maximize Data Access and UtilityMaximize Data Access and Utility

24

Data and Model Integration Neededto Support Hydrologic Science

ObservationsHydrologic Models

Weather and Climate Models

Physical Data

Socioeconomic Data

CUAHSI HIS

DFC

ODM ODM ODM

WaterOneFlow WaterOneFlow WaterOneFlow

HydroServerCapabilitiesDatabase

ODM Databases and Web Services

ArcGIS Server Spatial Data Services

Capabilities Database

Configuration Tool

SpatialServices

WaterOneFlow

Services

ODM

WOF

National Dataset Water Data Services

NWIS

WOF

ODM

WOF

ODM

WOF

ODM

WOF

HydroServer Distributed Water Data Services

STORET

WOF

DAYMET

WOF

MetadataCatalog

Ontology

HydroDesktop HIS Central

OntologyServices

Metadata Services

HydroDesktopPlug-ins

Desktop Data

Repository

Search, Download, and Manage Data

Subscriptions

Visualize and Summarize (TSA)

Convert Units

Convert Formats and Export

Import

Files

Files

Data Discover

yCalls

Web Service CallsWaterML

SNOTEL

WOF

WOF

MetadataHarvesting

VariableMapping

ODM

WOF

R

MATLAB

Excel

Desktop Analysis Software

Workflow

• 11 WATERS Network test bed projects• 16 ODM instances (some test beds have more than one ODM

instance)• Data from 1246 sites, of these, 167 sites are operated by WATERS

investigators

National Hydrologic Information ServerSan Diego Supercomputer Center

HydroServer Implementation in WATERS Network Information System

RHESSys

TOPS

ADAS

Meteorology, Hydrology, Ecological Models

WRF RHESSYSHEC-RAS

ADCIRCADCIRC

Scientific Research

Historical Re-Analysis

Disaster Planning

Disaster Response

Agricultural Forecasts

Ag Decision Support

Public Dissemination

Economic Planning

etc …

Sensor Data Bus

TOPS

StateClimateOffice

Sensor Cloud• National Weather Service• Department of Transportation / FAA• USGS NWIS, USFS• Buoys, Stream Gauges, Soil Moisture• People with mobile devices • etc …

CHPS

Enablement

29

Use Case: National Water Model

Terrain in the Neuse River Basin, NC constructed from 390 million LiDAR measurements

Flooding in the Mississippi River Basin, August 1993 observed from satellite imageryHydrologic scientist have expressed a “grand research

challenge” of building a National Water Model for flood and drought applications.

Achieving this goal will require a system like DFC to handle the massive data requirements.

Source: nasa.gov

Source: terrain.cs.duke.edu

30

CUAHSI Case Study• Hydrology Grand Challenge Problem: National Water

Model– How much water is available in the Nation’s water resources?– Currently, hydrologic models are implemented at the watershed-scale

(county)– Hydrologists plan to scale physically-based models to national level

• Provide CI, Policies & Sustainability for Water Model Data– Gathering, analysis, dissemination and preservation– Policies for quality control, metadata harvesting, versioning and usage– Enables the data required for real-time analysis for flood and drought

modeling– Enables integrating data from “new sources”– Enables new science, outreach, decision making and disaster recovery– Integration of Predictive Models, Real-time Data and Historic Data

• Technical Solutions– Too many systems/solutions, home grown to programs (CUAHSI)– Standards (ODM, OGC, Virtual USA, etc.)– Federal enterprises

– NOAA, CLASS general, heavy system– Oracle front end to large tape system

• Unique• Handling large sets with limited skills• Multidisciplinary, formats are not enough, but knowledge• Federal

– Has to work, has to preserve– Observation systems are getting more complex– Users are more sophisticated and demanding more

Data ManagementData Management

Large Storage Systems

Large Storage Systems

Compute and Servers

Compute and Servers

Firewall Security

HPCC ComputeHPCC Compute

iRODSiRODS

WorkflowWorkflow

Data ManageData Manage

DataNetData Management,Data Grid Testbed

Diversity in the Landscape

• Data grids to include generic data management infrastructure– Data sharing– Digital libraries, publish and discovery– Persistent archives for preservation– Data processing pipelines– Virtualize data collections

• File systems• Tape archives• Cloud storage• Institutional repositories• Digital repositories

Diversity in the Landscape

• Policy-based Data Management– Each center has same management needs but

implement different policies and procedures– Implement their own policies but leverage

standard data management– Interoperate with other repositories through

specific drivers that implement protocol

• Integrated Rule Oriented Data System (iRODS)

How to Federate?Users, services and local storage

• Clients – present information in context– User level file systems– Web browsers– Web services

• Workflow – manage processing steps• Data grid – access to the repositories

– Uniform name space– Properties (meta) and access (time stamp, version)– Policies – retention, disposition, authenticity, QA

• Storage Systems – tapes, file system, cloud

Safe Replication

• Repositories must be replicated• Data grids are good at this

– Making copies– Keeping track of copies– Integrity of copies– Disposition of copies (rules for retention and

checking)

Policy Rules for Control

• Actions that simplify use of data– Data sharing: access control, distribution, organizing– Publishing: Descriptive metadata, integrity,

replication– Data preservation: retention, disposition, trust,

ownership

• Data ingestion, storage, and access control

User Workspaces

• Needed for interim data products• Track operations performed on the data

– Same needs as repositories, only shorter timeframe

– Individual, organization, operation processing

Processing and workspaces

• Process of petabytes collections and distributed processing

• Process at local storage if simple processing• Move file is processing is complex or

demanding.• Data management views processing

transparently and facilitates:– Move files– Manage processing and workspace

Frameworks for distributed processing

• iRODS – integrated Rule Oriented Data System– Internal workflows (rules of microservices)– External workflows (Taverna, Kepler, Pegasus)– Data management decoupled from workflows and both can be

distributed

• Data interchange with workflow– Parameter passing (microservice)– In-memory structures (workflow and microservice)– In-memory, but distrubuted– Shared metadata, retrieved out of catalog– Shared files