1 large-scale data management challenges federating climate, water, and weather data...
Post on 21-Dec-2015
214 views
TRANSCRIPT
1
Large-Scale Data Management ChallengesFederating
Climate, Water, and Weather Data
Repository/Workspace Workshop20-21 September 2010
Kenneth GalluppiDirector, Disaster and Environmental Programs
Renaissance Computing InstituteUniversity of North Carolina at Chapel Hill
Outline
• Environmental Problem• Use Case
– Climate and Weather– Hydrology
• Data Grid/Workspace Use Cases• Answer Peter’s Questions
3
Environmental Science Problems
• Enable cutting edge, Grand Challenge multidisciplinary science through the federation of data-grids of climate, water, and weather data, with other geospatially and socially relevant datasets.
– Understanding of regional impacts of climate change on water availability and society trends
– Understanding and prediction of catastrophic weather-driven events under climate change
– Communicate risk/crisis knowledge non-specialists
4
Challenges of Data• Integration of Large, Multidisciplinary Datasets
– NCDC and NOAA Centers, SDSC, and others– Discover, access, integration, utility [not store/retrieve]
• Linkage of Datasets to Computational Models– Input/outputs for real-time model forecasting– Model-to-observation comparison– Climatic models for reanalysis and prediction
• Access to Large Reference Data– Climate Reanalysis Datasets, 1 PetaByte– NWS DataCube for aviation and emergencies
5
Collaboration and Datagrids
National ClimaticData Center
Emergency ManagementResearch Program
Federal Agencies Academic Research
140 universites
NOAA Mission:
Weather & Water Serve Society’s Needs for Weather and Water Information
Ecosystems Protect, Restore, and Manage the Use of Coastal and Ocean Resources through an Ecosystem Approach to Management
Climate Understand Climate Variability and Change to Enhance Society’s Ability to Plan and Respond
To understand and predict changes in Earth’s environment and conserve and manage coastal and marine resources to meet our nation’s economic, social, and environmental needs
National Climatic Data CenterNational Climatic Data Center
Commerce & Transportation Support the Nation’s Commerce with Information for Safe,Efficient, and Environmentally Sound Transportation
Mission Support Provide Critical Support for NOAA’s Mission
NOAA Goals:
Data Supports NOAA/NCDC Mission
The National Environmental Data ArchiveThe National Environmental Data Archive
ClimateAnalysis
RADAR
Satellite
Other
Comprehensive Large Array-data Stewardship System (CLASS) Storage
(reanalysis)
The National Environmental Data ArchiveThe National Environmental Data Archive
NOAA CLASS
• Large Structured data• Propriety• Doesn’t interface with
HPSS• Climate Support of
products and services• Does well, what it does
NOAA’s Data Centers Will Function in a NOAA’s Data Centers Will Function in a Wider Information LandscapeWider Information Landscape
NCDCNCDC NGDCNGDC
NODCNODCNSOFNSOF
NOAA’s Data Centers Will Function in a NOAA’s Data Centers Will Function in a Wider Information LandscapeWider Information Landscape
NOAA’s Data Centers Will Function in a NOAA’s Data Centers Will Function in a Wider Information LandscapeWider Information Landscape
ORNL,ESG
NSF DataNet
DAPs Data Mgmt
IPCC International Sources
International Sources
NEAAT
Climate Services using Federated DB’sNOAA’s Data Centers will need to provide access to petabytes of data that are distributed across multiple NOAA facilities
Be able to integrate these data with data from other disciplines (environmental, biological, social, etc..) that are distributed on other databases both in the public and private sector domain
Export data to common data formats - Shapefile, Well-Known Text, Arc/Info ASCII GRID, Gridded and Raw NetCDF, GeoTIFF and KMZ (Google Earth)
Coordinated, efficient,integrated, interoperable
Data Systems
Space Observations
Ocean Observations
Land Surface Observation
Atmospheric Observations
Discipline-Specific View Whole-System View
Current systems are program-specific, focused, individually efficient.But incompatible, not integrated, isolated from one another and from wider environmental community
1
Support :Disaster reductionHuman HealthClimateWater ResourcesWeatherOcean ResourcesAgriculture & Land-UseEcosystems
Data supports NOAA/NCDC Mission
• NCDC will need to function in a wider information landscape with a NOAA Federated Archive (6 data centers)– Support distributed data management and services
• Interoperable with DataNet, Earth System Grid, GEO-IDE, EOSDIS, etc.– netCDF, LDM, CF conventions, ISO 19115-2
• Move out of the Box and into the Cloud (networked)– Utilize highly distributed storage and computing (RENCI, Oak Ridge
National Lab
• Implement supporting technologies to enable interoperability with Designated Communities (OGC, WMS/WFS)
• Institute rules-based data management to enable true federation of NOAA Centers of Data – iRODS
16
NCDC-RENCI Potential Use Cases
• Catastrophic Event Modeling and Observations• Climate Reanalysis Datasets
– Climate records everywhere, for 30 years– 1-PetaByte– Regional and local sub-setting– Ten’s of thousands of users
• Multi-sensed Gridded Precipitation Climatology• Extreme Event Climatology• Green Energy, physical-social science Integration
17
High Level View of HIS Service Oriented ArchitectureAs of October 2009, 1,867,108 sites and4,336,790,286 data values where available through the HIS from federal, state, and academic data providers.
There have been 543,144 “GetValues” data requests from Feb 2008 to Oct 2009 .
http://his.cuahsi.org
Discovery
Hydroseek
Analysis
MATLAB , Excel , GIS, R, …
Modeling
Web services and WaterMLto transmit hydrologic
data in a standard way
GetSitesGetSiteInfoGetVariableInfoGetValues
Access
HydroDesktop
HIS Centralcatalog hydrologic data and metadatastore and
share hydrologic data
HIS Server
ODM
Observatories
publication and archival
of field data
3rd- Party Servers
include data from
others
HydroModeler
Hydrology Community
CUAHSI HISThe CUAHSI Hydrologic Information System (HIS) is an internet based system to support the sharing of hydrologic data. It is comprised of hydrologic databases and servers connected through web services as well as software for data publication, discovery and access.
Data Discovery and Integration platform
Data Publication platform
Data Synthesis and Research platform
Data Services
Metadata Services Metadata Search
HIS Central
HydroDesktopHydroServer
Service registr
ation
Catalog harvesting
Service and data theme metadata
Data carts
Water Data ServicesSpatial Data Services
Like search portals Google, Yahoo, Bing
Like browsersLike web servers
Like HTML
Publication of Point Observations
• Observations Data Model (ODM)– ODM Tools– ODM Data Loader– ODM Streaming Data Loader– ODM Controlled
Vocabularies
• WaterOneFlow web services– Data are transmitted in
WaterML format
Dynamic Controlled Vocabulary Moderation System
Local ODMDatabase
Master ODM Controlled Vocabulary
HIS CV Website
ODM ControlledVocabulary Moderator
ODM Data Manager
ODMControlled Vocabulary
Web Services
ODM Tools
Local Server
XMLXML
http://his.cuahsi.org/mastercvreg.html
Horsburgh, J. S., D. G. Tarboton, D. R. Maidment and I. Zaslavsky, (2008), A Relational Model for Environmental and Water Resources Data, Water Resour. Res., 44: W05406, doi:10.1029/2007WR006392.
CUAHSI Observations Data Model http://his.cuahsi.org/odmdatabases.html
24
Data and Model Integration Neededto Support Hydrologic Science
ObservationsHydrologic Models
Weather and Climate Models
Physical Data
Socioeconomic Data
CUAHSI HIS
DFC
ODM ODM ODM
WaterOneFlow WaterOneFlow WaterOneFlow
HydroServerCapabilitiesDatabase
ODM Databases and Web Services
ArcGIS Server Spatial Data Services
Capabilities Database
Configuration Tool
SpatialServices
WaterOneFlow
Services
ODM
WOF
National Dataset Water Data Services
NWIS
WOF
ODM
WOF
ODM
WOF
ODM
WOF
HydroServer Distributed Water Data Services
STORET
WOF
DAYMET
WOF
MetadataCatalog
Ontology
HydroDesktop HIS Central
OntologyServices
Metadata Services
HydroDesktopPlug-ins
Desktop Data
Repository
Search, Download, and Manage Data
Subscriptions
Visualize and Summarize (TSA)
Convert Units
Convert Formats and Export
Import
Files
Files
Data Discover
yCalls
Web Service CallsWaterML
SNOTEL
WOF
…
WOF
MetadataHarvesting
VariableMapping
ODM
WOF
R
MATLAB
Excel
Desktop Analysis Software
Workflow
• 11 WATERS Network test bed projects• 16 ODM instances (some test beds have more than one ODM
instance)• Data from 1246 sites, of these, 167 sites are operated by WATERS
investigators
National Hydrologic Information ServerSan Diego Supercomputer Center
HydroServer Implementation in WATERS Network Information System
RHESSys
TOPS
ADAS
Meteorology, Hydrology, Ecological Models
WRF RHESSYSHEC-RAS
ADCIRCADCIRC
Scientific Research
Historical Re-Analysis
Disaster Planning
Disaster Response
Agricultural Forecasts
Ag Decision Support
Public Dissemination
Economic Planning
etc …
Sensor Data Bus
TOPS
StateClimateOffice
Sensor Cloud• National Weather Service• Department of Transportation / FAA• USGS NWIS, USFS• Buoys, Stream Gauges, Soil Moisture• People with mobile devices • etc …
CHPS
Enablement
29
Use Case: National Water Model
Terrain in the Neuse River Basin, NC constructed from 390 million LiDAR measurements
Flooding in the Mississippi River Basin, August 1993 observed from satellite imageryHydrologic scientist have expressed a “grand research
challenge” of building a National Water Model for flood and drought applications.
Achieving this goal will require a system like DFC to handle the massive data requirements.
Source: nasa.gov
Source: terrain.cs.duke.edu
30
CUAHSI Case Study• Hydrology Grand Challenge Problem: National Water
Model– How much water is available in the Nation’s water resources?– Currently, hydrologic models are implemented at the watershed-scale
(county)– Hydrologists plan to scale physically-based models to national level
• Provide CI, Policies & Sustainability for Water Model Data– Gathering, analysis, dissemination and preservation– Policies for quality control, metadata harvesting, versioning and usage– Enables the data required for real-time analysis for flood and drought
modeling– Enables integrating data from “new sources”– Enables new science, outreach, decision making and disaster recovery– Integration of Predictive Models, Real-time Data and Historic Data
• Technical Solutions– Too many systems/solutions, home grown to programs (CUAHSI)– Standards (ODM, OGC, Virtual USA, etc.)– Federal enterprises
– NOAA, CLASS general, heavy system– Oracle front end to large tape system
• Unique• Handling large sets with limited skills• Multidisciplinary, formats are not enough, but knowledge• Federal
– Has to work, has to preserve– Observation systems are getting more complex– Users are more sophisticated and demanding more
Data ManagementData Management
Large Storage Systems
Large Storage Systems
Compute and Servers
Compute and Servers
Firewall Security
HPCC ComputeHPCC Compute
iRODSiRODS
WorkflowWorkflow
Data ManageData Manage
DataNetData Management,Data Grid Testbed
Diversity in the Landscape
• Data grids to include generic data management infrastructure– Data sharing– Digital libraries, publish and discovery– Persistent archives for preservation– Data processing pipelines– Virtualize data collections
• File systems• Tape archives• Cloud storage• Institutional repositories• Digital repositories
Diversity in the Landscape
• Policy-based Data Management– Each center has same management needs but
implement different policies and procedures– Implement their own policies but leverage
standard data management– Interoperate with other repositories through
specific drivers that implement protocol
• Integrated Rule Oriented Data System (iRODS)
How to Federate?Users, services and local storage
• Clients – present information in context– User level file systems– Web browsers– Web services
• Workflow – manage processing steps• Data grid – access to the repositories
– Uniform name space– Properties (meta) and access (time stamp, version)– Policies – retention, disposition, authenticity, QA
• Storage Systems – tapes, file system, cloud
Safe Replication
• Repositories must be replicated• Data grids are good at this
– Making copies– Keeping track of copies– Integrity of copies– Disposition of copies (rules for retention and
checking)
Policy Rules for Control
• Actions that simplify use of data– Data sharing: access control, distribution, organizing– Publishing: Descriptive metadata, integrity,
replication– Data preservation: retention, disposition, trust,
ownership
• Data ingestion, storage, and access control
User Workspaces
• Needed for interim data products• Track operations performed on the data
– Same needs as repositories, only shorter timeframe
– Individual, organization, operation processing
Processing and workspaces
• Process of petabytes collections and distributed processing
• Process at local storage if simple processing• Move file is processing is complex or
demanding.• Data management views processing
transparently and facilitates:– Move files– Manage processing and workspace
Frameworks for distributed processing
• iRODS – integrated Rule Oriented Data System– Internal workflows (rules of microservices)– External workflows (Taverna, Kepler, Pegasus)– Data management decoupled from workflows and both can be
distributed
• Data interchange with workflow– Parameter passing (microservice)– In-memory structures (workflow and microservice)– In-memory, but distrubuted– Shared metadata, retrieved out of catalog– Shared files