scientific workflows and the dissemination of computations...

Post on 08-Jun-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Scientific Workflows and the Dissemination of Computations and Data

Goals

• Generalize the generic functionality into reusable frameworks

• Create a simple language to share, discuss and re-execute flows of computations

• Enable dissemination of computations and results both on the desktop and on the web

Another view of Sp2Learn/GeoLearn

• Sp2Learn/Geolearn is a sequence of steps• Each step leads to next step in sequence• Sequence of steps can be re-executed• Sequence of steps is scientific workflow

LoadRaster(s)

CombineRasters

ExtractArea

SelectInputs and

Outputs

CreatePrediction

Model

ComputeAccuracy

Scientific Workflow

Wikipedia defines a scientific workflow as:“A scientific workflow is the process of combining data and processes into a configurable, structured set of steps that implement semi-automated computational solutions of a scientific problem. Scientific workflow systems often provide graphical user interfaces to combine different technologies along with efficient methods for using them, and thus increase the efficiency of the scientists.”

Requirements (1/3)

• Allow for reuse of existing tools• Don’t force the use of our favorite programming language• Don’t make people re-implement their tools

• Allow for sharing of knowledge• Share the data/tools/workflows

• Standards• Use of standard technologies where possible

• Provenance• Who did what when and how?

• Security• Limit access to data• Limit access to compute resources

Requirements (2/3)

• Create an editor for scientific workflows• Don’t hide the workflow

• Easy to extend• Support java, compiled code, matlab, …

• Easy to use• Playing is more fun (try things out, don’t solve everything first)

• Remote execution• Long running jobs• Compute intensive jobs• Limited resources

Requirements (3/3)

• Allow for web access• Allows for easy sharing• Allows for easy re-execution• Allows for visualizations of results• No required downloads

Provenance

Wikipedia defines a provenance as:“Provenance, from the French provenir, "to come from", means the origin, or the source, of something, or the history of the ownership or location of an object. The term was originally mostly used of works of art, but is now used in similar senses in a wide range of fields, including science and computing.”

Cyberintegrator & DSE

• 2 proposed frameworks we are currently building• Playgrounds for us to

• try new features in• solve problems presented to us by different communities

• Cyberintegrator is currently in beta• DSE going into alpha release• All code developed is OpenSource

Birthday Weather Demo (Input Form)

Birthday Weather Demo (Result)

Birthday Weather Demo (Execution)

Birthday Weather Demo

• Data for 48 contiguous Unites States• Basic meteorological variables• Over 1000 observing stations• Data from 1871 to 2005 available

• 1,250,055 data points computed per execution

High Level System Architecture

Information Sharing

• Virtual Communities• Multiple people want access to same data/tools etc• Sharing of knowledge

• Multiple applications want shared access• Cyberintegrator• DSE• Cyberintegrator execution service• …

• Data is stored as blobs and metadata about data as RDF

Resource Description Framework (RDF)

• RDF data model is based upon the idea of making statements about resources, in the form of subject-predicate-object expressions

• For example "The sky has the color blue" in RDF:• a subject denoting “the sky”• a predicate denoting “has the color”• an object denoting “blue”

• A collection of RDF statements intrinsically represents a labeled, directed multi-graph.

Shared Content Repository

•Mysql•Derby•Sesame•File System•WebDAV

Content Repository

Cyberintegrator

Cyberintegrator Editor

Cyberintegrator

• Focus attention on exploration• Support discovery in workflow creation via ‘Macro-recording’

style interface• Separate science from ‘logistics’

• Workflows as a communication mechanism• Make workflows (templates and provenance of runs)

documented and sharable

• Enable integration of independent tools• Keep models, algorithms, data in open formats accessible from

outside the scientific workflow system

Workflow View

• Show tools executed and information• Parameters used for execution• Who executed the tool• When was it executed (and when did it finish)

• Data sets used for input• Data sets generated as output

Cyberintegrator Architecture

• RCP application• Plugin based• Engine plugin

• Threaded Engine• Remote Engine

• Executor plugin• Java• Matlab• Command Line

Cyberintegrator

Engine

Executor Executor Executor

Tool Creation

• Use wizards (defined by executor)• All resources are stored with tool definition

• All files are zipped and uploaded to repository• Can export tool and look at included files

• When executing tool all files downloaded• File are stored in temp folder• Inputs are stored in temp folder as well• No need to have tool installed on local machine!

• Can edit tool• We all make mistakes• Trail of edits is stored (and documented)

Matlab Tool Creation

Script to execute

Resources required by script

Digital System Explorer

Digital System Explorer (DSE)

• A simple and accessible web interface for the end user to browse and share scientific workflows and results on the web

• A set of libraries to build rich internet applications• Intuitive interfaces for

• Scientific data• Aggregate scenarios around workflows and executions of

workflows• Provenance trail

• Requirements• Easy access: web browser• Open system: ease of integration with existing applications.

Restful web services, RSS feeds, web gadgets, portlets

Computation: Input Page

Text Input Widget

Map Input Widget

Computation: Executing

Execution Status

WorkflowSteps

Computation: Results Page

Datasets produced

by execution

Simple Visualization

New Computation

• Publishing a workflow by• Providing high level descriptions• Selecting the workflow to publish• Selecting what parameters to make available and how• Selecting what datasets to publish as results of an execution and

how to visualize them

New Computation: General Information

New Computation: Select Workflow

New Computation: Select Parameters

Configure Input Widgets

Configure Output Visualizations

Publishing Workflows

• Publishing workflows to the web is• Wizard driven• Require no writing of code

• Similarly to other web 2.0 sites, we are counting on the wisdom of the crowds to self organize to solve scientific problems

• One possible caveat• Privacy and being willing to share scientific results

Possible Future Work

• Migrate SP2Learn and GeoLearn functionality to Cyberintegrator framework

• Add more input widget types and visualizations types to the DSE framework

• Add collaborative features such as social tagging and discussion thread to the DSE (this is already available in Cyberintegrator)

Scenarios

• Two ongoing efforts to highlight some of the features discussed so far:• Plant Growth 4-H• Virtual Sensors

Plant Growth 4-H Scenario

Goals

• Incorporates a state-of-the-art generic crop growth simulation model and historic weather data for the purposes of designing educational activities for young learners.

• Enable informal learners to operate a high fidelity plant growth model through a web interface.

Team

• Luigi Marini, Andrew Wadsworth, Terry McLaren, Jim Myers, Raouf Berrabah, Joe Mansour, CET, NCSA;

• Anand Padmanabhan, NCSA & Dept. of Geography; • Lisa Bouillion-Diaz, Extension Specialist, 4-H Youth; • Xinguang Zhu, NCSA & Dept. of Plant Biology; • Wen Wu Tang, Dept. of Geography; • Dennis Bowman, Extension Educator, Crop Sciences; • Bill Million, Extension Specialist, 4-H Youth, UIUC

The model

• “WIMOVAC (Windows Intuitive Model of Vegetation response to Atmosphere and Climate Change) is designed to facilitate the modelling of various aspects of plant photosynthesis with particular emphasis on the effects of global climate change.”

• S.W. Humphries , and S.P. Long WIMOVAC: a software package for modelling the dynamics of plant leaf and canopy photosynthesis Comput. Appl. Biosci. 11: 361-371.

Exposed Parameters

• Wrapped the model as a Cyberintegrator tool

• Exposed few relevant parameters as tool parameters

• This allowed us to easily create several workflows to be used in the different activities

Activities

• Activity 1 - Compare historical crop yield results in different regions.

• Activity 2 - Learn about the effects of CO2 levels in the atmosphere on plant growth.

• Activity 3 - Determine amount of seed to plant for optimal crop yield.

Plant Growth 4-H: Activity 1

• Compare historical crop yield results in different regions.• User input: latitude, longitude• Output: county, state, soil type and yield (bushels/acre)

of Corn or Soybeans visualized as table and plant graph

Plant Growth 4-H: Activity 1

Plant Growth 4-H: Activity 1

Plant Growth 4-H: Activity 1

Behind the Scenes

• Input map widget (similar to the one in weather scenario) passes on Latitude.

• Simple web application retrieves information about that particular point using PostGIS* queries.

• PostGIS supports spatial comparison functions such as ST_Containts (whether one geometry is completely contained by anther geometry)

*Adds support for geographic objects to the PostgreSQL object-relational database

Behind the Scenes

• Data was collected by Dennis Bowman and WenwuTang from:• USDA - National Resource Conservation Services• USDA - National Agricultural Statistics Services

Plant Growth 4-H: Activity 2

• Effects of CO2 levels in the atmosphere on plant growth.• User input: 5 CO2 levels (one per model execution) and

latitude of location• Output: yield (bushels/acre) of Corn or Soybeans

visualized in table, plant graph and growth line graph

Plant Growth 4-H: Activity 2

Plant Growth 4-H: Activity 2

Plant Growth 4-H: Activity 2Yi

eld

Behind the Scenes

• Each text box maps to the Carbon Dioxide Concentration parameter of model execution

• The location on the map defines the latitude of all five executions

• When the user clicks “Run All” a new instance of the workflow is instantiated

• The second page polls the server waiting for all executions to complete

Behind the Scenes

• Once the executions are done running the results page parses the output files to visualize a specific subset of results

• The model itself outputs the results in a CSV format with more than 100 variables

• In this particular case we are only interested in the total yield

Plant Growth 4-H: Activity 3

• Determine amount of seed to plant for optimal crop yield.• User input: 5 seed/m2 values (one per model execution)

and latitude of location• Output: yield (bushels/acre) of Corn or Soybeans

visualized in table, plant graph and growth line graph

Plant Growth 4-H: Activity 3

Plant Growth 4-H: Activity 3

Plant Growth 4-H: Activity 3Yi

eld

Future Work

• Reuse workflows and tools to create a more advanced scenario for researchers and/or scientists• Ability to tweak more parameters• Visualize more of the output data

• Add collaborative features such as the ability to discuss results in threaded discussions

Virtual Sensor Scenario

Goals

• Being developed to support a Chicago watershed research project to provide a real-time decision support system for optimal control of the Combined Sewage Overflow system

• Virtual rain gages can be defined on the Google map by clicking the "Add VS" button and clicking on a location in the Google Map

The Concept of Virtual Sensors

• Our definition of a virtual sensor is• the product of thematic, spatial, and/or temporal transformation and

aggregation of one or multiple raw sensor measurement(s)

• An example of a virtual sensor• From WATERS Network SEDS Draft (Chapter 5, p108)

• Signals from arrays of individual sensors and clusters of such arrays would be combined to provide higher-level information. For example, an array of soil moisture and temperature sensors might be coupled to a microclimate array to provide a virtual soil moisture flux sensor.

Slide courtesy of Yong Liu

Team

• Yong Liu, David Hill, Alejandro Rodriguez, Luigi Marini, Rob Kooper, James Myers, Terry McLaren, Nick Michal, Xiaowen Wu, NCSA;

• Barbara Minsker, NCSA and Civil and Environmental Engineering at UIUC;

User-Created Virtual Sensors in the Upper Illinois Watershed

User-created Virtual Sensor

Locations

Virtual Sensor Time-series

Plot(derived from

KLOT NEXRAD)

Slide courtesy of Yong Liu

USGSGages(green bubbles)

NEXRAD

• The Next Generation Radar (NEXRAD) system is a network of approximately 160 high-resolution Doppler weather radars operated by the National Weather Service.

• The NEXRAD system measures reflectivity, radial velocity and spectrum width of the radar echoes returned from volumes within the atmosphere at a frequency of 5, 6, or 10 minutes (but never exact) depending on the operation mode of the radar.

• The reflectivity is correlated with the precipitation rate.

Virtual Sensor based on NEXRAD

• Workflows• Provides NEXRAD Level II-based virtual sensor data stream in near-real-time

0. NEXRAD data stream 1. Spatial transformation to points

2. Thematic transformation to rainfall rates

3. Publish one data stream per point of interest

4. Temporal aggregation to produce n-minute rainfall accumulation at one point

5. Publish one virtual sensor data stream with n-minute

rainfall accumulation

•Workflow 1: step 0,1,2,3•Run periodically at the arrival rate of the NEXRAD Level II data stream

•Workflow 2: step 4,5•run at the user-specified rain accumulation interval

•E.g.: every 20 minutes

Slide courtesy of Yong Liu

Workflows

• One master workflow handling the spatial and thematic transformation• Triggered when new data is available

• One workflow per Virtual Sensor handling the temporal transformation

• Tools in workflows wrap C++ model• Streams are the links between the pieces

A Virtual Sensor Data Model

Virtual SensorhasLocation

SpatialThing

Point Polygon

isAisA

hasDataStream DataStreamderivedFrom

hasThematicInterest

ThematicIntereste.g. rainfall rate, rain fall accumulation

TemporalFrequency GIS Layer

hasTemporalIntervalbelongsToLayer

A Virtual Sensor is more than just a new time-series data stream.

Slide courtesy of Yong Liu

Metadata and Standards

• Using existing namespaces and standards:• http://www.opengis.net/sensorML/1.0.1/uom• http://www.w3.org/2003/01/geo/wgs84_pos#Point • http://www.opengis.net/gml/location

• Describing metadata in a portable and expressive framework:• Resource Description Framework (RDF)

• (Think XML squared)

Publish Virtual Sensor as OGC SWE-compliant SensorML format

<sml:SensorML version="1.0.1" xsi:schemaLocation="http://www.opengis.net/sensorML/1.0.1 http://schemas.opengis.net/sensorML/1.0.1/sensorML.xsd">

<sml:identifier name="URI"><sml:Term definition="urn:ogc:def:identifierType:OGC:uniqueID"><sml:value>tag:cet.ncsa.uiuc.edu,2008:/VirtualSensor/Sears/rainfall-rate</sml:value></sml:Term></sml:identifier>......

<sml:identifier name="derivedFrom"><sml:value>NEXRAD Level II data from WSR-88D KLOTDoppler radar</sml:value>............

<sml:methodname="SpatialandThematicTransformation"xlink:href="http://sensorweb-dev.ncsa.uiuc.edu:8190/cyberintegrator/cron/jobs/4327600e-a8a5-4cec-9d00-f099081b764e/>

What source data is used to derive the virtual sensor?

What workflow is used?

•Provenance informationIs available for verification

•Interoperability is maintainedthrough SWE-compliant publishing

Slide courtesy of Yong Liu

Summary and Conclusions

Summary

• Common Requirements• Reusable Solutions• Conclusions

Common Requirements• Data:

• Access large amounts of hydrologic, geographic, meteorological, water quality, soil type, land-use and many other types of data

• Ingest and integrate heterogeneous large size data and streaming data• Computational Resources:

• Perform complex CPU and memory intensive data-driven analyses• Utilize a spectrum of distributed computational resources

• Data-driven Analyses (Software):• Design data-driven (data mining, machine learning, pattern recognition,

statistical) analyses• Visualize and interpret data-driven models• Integrate data-driven models with physics/chemistry/bio based models

• Data and Software Integration:• Exercise seamlessly functionality present in heterogeneous software

packages using available computational resources• Provide an environment where heterogeneous visualization and mining

tools could be integrated into workflows, and the analysis workflows could be re-used and modified.

75

Reusable Solutions

• Multiple layers of abstraction

DataSensors Algorithms

Computational ResourcesFrameworksEnd User

Applications

Users

Conclusions

• Geospatial data increasing exponentially• 160 NEXRAD stations, physical sensors, satellites, historical

maps, planmaps, virtual sensors, …

• More earth observatories being planned• Resulting in even more data

Conclusions

• Research is done in larger groups• Problems are getting bigger• More people are interested in results

• Need to create tools to share knowledge• Both the final results as well as how the results are obtained• Need to allow for re-use of workflows

• Need to create infrastructures for re-use• We should not create custom applications for each problem

• Users are what matters• Whether they are researchers, students, stakeholders, scientists,

developers, etc.

Acknowledgement

• This research was partially supported by National Aeronautics and Space Administration (NASA), the Faculty Fellow Program at National Center for Supercomputing Applications (NCSA), the Illinois State Water Survey (ISWS), NCSA Industrial partners, ONR Technology Research Education and Commercialization Center (TRECC), State of Illinois, Costa Rica CENAT, NARA, UIUC Provost Office, Google Summer of Code

• Contributions by the members of the Image Spatial Data Analysis (ISDA) Group, CET Division at NCSA and our collaborators from ISWS, CEE UIUC, University of Illinois Extension 4-H and International Institutions

Imaginations unbound

Disclaimer

• The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the sponsors.

Thank you! Questions?

• Slides will be posted by the end of the week at• http://isda.ncsa.uiuc.edu/ILGISA/

• More information on the projects is available at• http://isda.ncsa.uiuc.edu• http://cet.ncsa.uiuc.edu

• Contact Information • Peter Bajcsy - pbajcsy@ncsa.uiuc.edu• Michal Ondrejcek - mondrejc@ncsa.uiuc.edu• Rob Kooper - kooper@ncsa.uiuc.edu• Luigi Marini - lmarini@ncsa.uiuc.edu

top related