artemis: integrating scientific data on the grid

25
1 Artemis: Integrating Artemis: Integrating Scientific Data on Scientific Data on the Grid the Grid Rattapoom Tuchinda Rattapoom Tuchinda Snehal Thakkar Snehal Thakkar Yolanda Gil Yolanda Gil Ewa Deelman Ewa Deelman

Upload: happy

Post on 19-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Artemis: Integrating Scientific Data on the Grid. Rattapoom Tuchinda Snehal Thakkar Yolanda Gil Ewa Deelman. Outline. Motivation Data integration needs in scientific applications Distributed computing in grids Problem statement Artemis architecture Evaluation Related Work - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Artemis: Integrating Scientific Data on the Grid

11

Artemis: Integrating Artemis: Integrating Scientific Data on the GridScientific Data on the Grid

Rattapoom TuchindaRattapoom TuchindaSnehal ThakkarSnehal Thakkar

Yolanda GilYolanda GilEwa DeelmanEwa Deelman

Page 2: Artemis: Integrating Scientific Data on the Grid

22

OutlineOutline MotivationMotivation

Data integration needs in scientific applicationsData integration needs in scientific applicationsDistributed computing in gridsDistributed computing in grids

Problem statementProblem statement Artemis architecture Artemis architecture EvaluationEvaluation Related WorkRelated Work Conclusions and future workConclusions and future work

Page 3: Artemis: Integrating Scientific Data on the Grid

33

Scientific Data Integration Scientific Data Integration Large-scale, cross-disciplinary scientific Large-scale, cross-disciplinary scientific

data collection, storage, and analysis data collection, storage, and analysis exacerbates heterogeneity and dynamicsexacerbates heterogeneity and dynamicsNational Virtual Observatory (NVO)National Virtual Observatory (NVO)Earth System Grid (ESG)Earth System Grid (ESG)

Page 4: Artemis: Integrating Scientific Data on the Grid

44

Grid Computing Grid Computing [Foster & Kesselman 04][Foster & Kesselman 04]

Grids provide middleware services for distributed computing:Grids provide middleware services for distributed computing: Seamless integration and management of resources – OGSASeamless integration and management of resources – OGSA Job submission and execution management – CondorJob submission and execution management – Condor Resource availability & performance – Monitoring and Directory Svc (MDS) Resource availability & performance – Monitoring and Directory Svc (MDS) Data replication for robustness and efficiency – Replica Loc Svc (RLS)Data replication for robustness and efficiency – Replica Loc Svc (RLS) Descriptions of data sources – Metadata Catalog Services (MCS)Descriptions of data sources – Metadata Catalog Services (MCS)

RDiscovery

Many sourcesof data, services,computation

R

Registries organizeservices of interestto a community

Access

Data integration activities may require access to, & exploration/analysis of, data at many locations

Exploration & analysismay involve complex,multi-step workflows

RMRM

RMRM

RM

Resource managementis needed to ensureprogress & arbitrate competing demandsSecurity

serviceSecurityservice

PolicyservicePolicyservice

Security & policymust underlie access& managementdecisions

From [Kesselman 04]:

Page 5: Artemis: Integrating Scientific Data on the Grid

55

Scientific Data Storage and AccessScientific Data Storage and Access Data sources are Data sources are very heterogeneousvery heterogeneous

Data that results from various instruments, disciplines, and types of analysesData that results from various instruments, disciplines, and types of analyses Wide variety of data storage systems (files, DBs, servers, etc)Wide variety of data storage systems (files, DBs, servers, etc)

Data sources are Data sources are highly distributed highly distributed Data stored in different locations on the gridData stored in different locations on the grid Data is replicated in multiple locationsData is replicated in multiple locations

Data sources are Data sources are highly dynamichighly dynamic Data grows continuously, new data models are routineData grows continuously, new data models are routine New data sources regularly appear New data sources regularly appear Data sources may become unavailable sporadicallyData sources may become unavailable sporadically

Data available at Data available at unprecedented scaleunprecedented scale Very soon petabytesVery soon petabytes

These challenges are in the way of scientific progress These challenges are in the way of scientific progress in many disciplinesin many disciplines

Page 6: Artemis: Integrating Scientific Data on the Grid

66

Data Storage and Access in GridsData Storage and Access in Grids Data described with metadata attributesData described with metadata attributes

Attribute names may not be consistent across different Attribute names may not be consistent across different sourcessources

Metadata descriptions often stored separately from the Metadata descriptions often stored separately from the data itselfdata itself

Metadata Catalog Service (MCS) Metadata Catalog Service (MCS) [Moore et al 01, Singh [Moore et al 01, Singh et al 03]et al 03] Stores descriptive metadata and allows users to query Stores descriptive metadata and allows users to query

based on desired attributesbased on desired attributes Addresses heterogeneity of data source Addresses heterogeneity of data source

implementations and accessimplementations and access

Page 7: Artemis: Integrating Scientific Data on the Grid

77

Sample QuerySample Query search constraints: search constraints: keywords = "atmospheric data" or "climate data“ keywords = "atmospheric data" or "climate data“ or "climate model“ or "climate model“ model type = "CCSM" or "PCM“model type = "CCSM" or "PCM“ period = 2001period = 2001

search results: search results: Files, collections, or viewsFiles, collections, or views::                            /CCSM2/b20.007/atm                            /CCSM2/b20.007/atm                            /PCM/B06.62/atm                            /PCM/B06.62/atm                            /PCM/B06.20/atm                            /PCM/B06.20/atm                            /PCM/B06.21/atm                            /PCM/B06.21/atm

Page 8: Artemis: Integrating Scientific Data on the Grid

88

Problem StatementProblem Statement Users should have seamless single point access Users should have seamless single point access

Should not have to formulate a different query for each sourceShould not have to formulate a different query for each source Should not manage the unavailability of data sourcesShould not manage the unavailability of data sources

Users need assistance formulating the queriesUsers need assistance formulating the queries Data models may have different attribute names and Data models may have different attribute names and

representations (even from the same source) representations (even from the same source) New data models/metadata attributes created all the timeNew data models/metadata attributes created all the time

MCS1

MCS2

MCS3

DB1

DB2

DB3

q1q2

q3

stimeetime

starttime

endtime

descrsub

currentlyunavailable

Page 9: Artemis: Integrating Scientific Data on the Grid

99

ArtemisArtemis A mixed-initiative data integration system that A mixed-initiative data integration system that

aims to:aims to: Abstracts users from diversity in attribute Abstracts users from diversity in attribute

representationsrepresentations Assists users to formulate queries step-by-stepAssists users to formulate queries step-by-step Manages the access and availability of dynamic Manages the access and availability of dynamic

collections of data sourcescollections of data sources Integrates and extends various AI techniques:Integrates and extends various AI techniques:

Data IntegrationData Integration OntologyOntology Dialogue wizardsDialogue wizards

Page 10: Artemis: Integrating Scientific Data on the Grid

1010

ApproachApproachstime

etime…

starttime

endtime

description

subject

stime starttime etime endtime

Time

Start time End time

ONTOLOGY

QueryMediatorQuery

FormulationWizard

Find files with Start time > 500000 ^ End time < 600000Start time > 500000 ^ End time < 600000

Data Source

MetadataCatalog2

Data Source

Data Source

MetadataCatalog3

MetadataCatalog1

Page 11: Artemis: Integrating Scientific Data on the Grid

1111

Artemis ArchitectureArtemis Architecture

Entityselection

Filters

MCS WizardDynamic

ModelGenerator

PrometheusQuery

Mediator

MetadataCatalogService

MetadataCatalogService

MetadataCatalogService

Data Source

Data Source

Data Source

Ontology

ModelMappings

Models

Page 12: Artemis: Integrating Scientific Data on the Grid

1212

MCS WizardMCS WizardBased on the Agent Wizard [Tuchinda Based on the Agent Wizard [Tuchinda

2003]2003]Domain experts create mappings between Domain experts create mappings between

Ontologies and meta-data attributesOntologies and meta-data attributesusers can then pick the ontology and the users can then pick the ontology and the

mappings relevant to their domain.mappings relevant to their domain.Guides the user through available Guides the user through available

operations and filters consistent with the operations and filters consistent with the models of the data. models of the data.

Page 13: Artemis: Integrating Scientific Data on the Grid

1313

Prometheus Query MediatorPrometheus Query Mediator Data integration system from earlier research Data integration system from earlier research

[Thakkar et. al. 2004] [Knoblock et al 2003][Thakkar et. al. 2004] [Knoblock et al 2003] Provides unified query interface to a wide variety of Provides unified query interface to a wide variety of

data sourcesdata sources Relational modelRelational model Requires pre-defined domain model relating sources Requires pre-defined domain model relating sources

to domain relationsto domain relations Extended in Artemis to support:Extended in Artemis to support:

Source relations: Various MCSsSource relations: Various MCSs Domain relationsDomain relations

File, View, CollectionFile, View, Collection Dynamic domain model based on availability of data Dynamic domain model based on availability of data

sourcessources

Page 14: Artemis: Integrating Scientific Data on the Grid

1414

Dynamic Model GenerationDynamic Model Generation Generate mediator model dynamically by Generate mediator model dynamically by

querying MCSsquerying MCSs Convert object oriented model of MCSs to relational Convert object oriented model of MCSs to relational

model of the mediatormodel of the mediator Handles dynamic nature of data by generating new Handles dynamic nature of data by generating new

domain models at query timedomain models at query time Intuitive ideaIntuitive idea

Query MCSs one at a time for all possible attributes of Query MCSs one at a time for all possible attributes of different objectsdifferent objects

Create domain relation for each object type with all Create domain relation for each object type with all possible attributespossible attributes

Create rules defining each MCS as data sourceCreate rules defining each MCS as data source Relate various data sources to domain relationsRelate various data sources to domain relations

Page 15: Artemis: Integrating Scientific Data on the Grid

1515

Dynamic Model Generator (Cont’d)Dynamic Model Generator (Cont’d) ExampleExample

MCS 1:MCS 1: File1(starttime, endtime, frequency), File2(starttime, endtime, frequency, File1(starttime, endtime, frequency), File2(starttime, endtime, frequency,

amplitude)amplitude) MCS 2:MCS 2:

File3(starttime, endtime, lat, lon, temp), File4(starttime, endtime, lat, lon, File3(starttime, endtime, lat, lon, temp), File4(starttime, endtime, lat, lon, windspeed)windspeed)

Domain relationDomain relation File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name)File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name)

Source relationsSource relations MCS1File(starttime, endtime, frequency, amplitude, name)MCS1File(starttime, endtime, frequency, amplitude, name) MCS2File(starttime, endtime, lat, lon, temp, windspeed, name)MCS2File(starttime, endtime, lat, lon, temp, windspeed, name)

Domain RulesDomain Rules File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :-

MCS1File(starttime, endtime, frequency, amplitude, name)^MCS1File(starttime, endtime, frequency, amplitude, name)^ (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’)(lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :-

MCS2File(starttime, endtime, lat, lon, temp, windspeed)^MCS2File(starttime, endtime, lat, lon, temp, windspeed)^ (frequency = ‘’) ^ (amplitude = ‘’)(frequency = ‘’) ^ (amplitude = ‘’)

Page 16: Artemis: Integrating Scientific Data on the Grid

1616

Query ProcessingQuery Processing When Prometheus receives a query it When Prometheus receives a query it

determines which MCSs are relevantdetermines which MCSs are relevant Relevant MCSs are determined by comparing Relevant MCSs are determined by comparing

the constraints of the query with the constraints the constraints of the query with the constraints of the MCSsof the MCSs

MCSs that do not satisfy constraints of the query MCSs that do not satisfy constraints of the query are not used in the queryare not used in the query For example, if the query asked for finding files that For example, if the query asked for finding files that

contained data for some lat, lon then MCS1 would not contained data for some lat, lon then MCS1 would not be queriedbe queried

Page 17: Artemis: Integrating Scientific Data on the Grid

1717

Query Processing: ExampleQuery Processing: Example Let’s say, the user uses the MCSWizard to form the following query.Let’s say, the user uses the MCSWizard to form the following query.

Q(name) :- Q(name) :- File(starttime, endtime, frequency, amplitude, lat, lon, temp, File(starttime, endtime, frequency, amplitude, lat, lon, temp,

windspeed, name)^ windspeed, name)^(lat > 33)^(lat < 34)^(lat > 33)^(lat < 34)^(lon < -118)^(lon > -119)^(lon < -118)^(lon > -119)^(starttime > 50000)^(endtime < 60000)(starttime > 50000)^(endtime < 60000)

The Prometheus mediator would generate a datalog program with The Prometheus mediator would generate a datalog program with the query and domain rulesthe query and domain rulesFile(starttime, endtime, frequency, amplitude, lat, lon, temp, File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, windspeed, name) :- name) :- MCS1File(starttime, endtime, frequency, amplitude, name)^MCS1File(starttime, endtime, frequency, amplitude, name)^

(lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’)(lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’)

File(starttime, endtime, frequency, amplitude, lat, lon, temp, File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, windspeed, name) :- name) :- MCS2File(starttime, endtime, lat, lon, temp, windspeed)^MCS2File(starttime, endtime, lat, lon, temp, windspeed)^

(frequency = ‘’) ^ (amplitude = ‘’)(frequency = ‘’) ^ (amplitude = ‘’)

Page 18: Artemis: Integrating Scientific Data on the Grid

1818

Query Processing: ExampleQuery Processing: Example Let’s say, the user uses the MCSWizard to form the following query.Let’s say, the user uses the MCSWizard to form the following query.

Q(name) :- Q(name) :- File(starttime, endtime, frequency, amplitude, lat, lon, temp, File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name)^ windspeed, name)^(lat > 33)^(lat < 34)^(lat > 33)^(lat < 34)^(lon < -118)^(lon > -119)^(lon < -118)^(lon > -119)^(starttime > 50000)^(endtime < 60000)(starttime > 50000)^(endtime < 60000)

The Prometheus mediator would generate a datalog program with the query The Prometheus mediator would generate a datalog program with the query and domain rulesand domain rulesFile(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- name) :- MCS1File(starttime, endtime, frequency, amplitude, name)^MCS1File(starttime, endtime, frequency, amplitude, name)^

(lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’)(lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’)

File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- name) :- MCS2File(starttime, endtime, lat, lon, temp, windspeed)^MCS2File(starttime, endtime, lat, lon, temp, windspeed)^

(frequency = ‘’) ^ (amplitude = ‘’)(frequency = ‘’) ^ (amplitude = ‘’) The mediator determines that the order constraints in the rule one on lat and The mediator determines that the order constraints in the rule one on lat and

lon attribute are not compatible with the order constraints on lat and lon in lon attribute are not compatible with the order constraints on lat and lon in the query, so only MCS2 is queriedthe query, so only MCS2 is queried

Page 19: Artemis: Integrating Scientific Data on the Grid

1919

Artemis: Top level SelectionArtemis: Top level Selection

Page 20: Artemis: Integrating Scientific Data on the Grid

2020

Artemis: FilteringArtemis: Filtering

Page 21: Artemis: Integrating Scientific Data on the Grid

2121

EvaluationEvaluation Enabled users to query 12 different MCSsEnabled users to query 12 different MCSs Covering information from three different Covering information from three different

applicationsapplications LIGO, ESG, and Geo-spatial data warehouseLIGO, ESG, and Geo-spatial data warehouse

Covering 17,000 different filesCovering 17,000 different files Metadata consisted of about 300 different Metadata consisted of about 300 different

attributesattributes Simulated addition of metadata to MCSs and Simulated addition of metadata to MCSs and

failure of several MCSs while system was failure of several MCSs while system was runningrunning

Page 22: Artemis: Integrating Scientific Data on the Grid

2222

Related WorkRelated Work MCS [Singh et al 03]MCS [Singh et al 03]

Organize metadata about objects on the data grid Organize metadata about objects on the data grid Object oriented schema to support user defined metadata Object oriented schema to support user defined metadata

attributesattributes Difficult for users to keep track of diverse attribute namesDifficult for users to keep track of diverse attribute names No semantic information is attached to the attributesNo semantic information is attached to the attributes

Agent Wizard [Tuchinda et. al. 2003]Agent Wizard [Tuchinda et. al. 2003] Interactive application that guides user by dividing complex tasks Interactive application that guides user by dividing complex tasks

as series of simpler question answering tasksas series of simpler question answering tasks Challenge is to model complex task as set of simpler subtasksChallenge is to model complex task as set of simpler subtasks

Prometheus Mediator [Thakkar et. al. 2004]Prometheus Mediator [Thakkar et. al. 2004] Data integration system that can efficiently integrate data from a Data integration system that can efficiently integrate data from a

wide variety of data sourceswide variety of data sources Key restriction is that relational schema for data sources and Key restriction is that relational schema for data sources and

domain must be known in advancedomain must be known in advance

Page 23: Artemis: Integrating Scientific Data on the Grid

2323

Related Work (Cont’d)Related Work (Cont’d) Mygrid [Wroe 2003]Mygrid [Wroe 2003]

Model data sources as semantic web servicesModel data sources as semantic web services Integration of data sources is represented as a Integration of data sources is represented as a

workflowworkflow Requires that data sources have fixed schema and Requires that data sources have fixed schema and

associated semanticsassociated semantics Model-based mediator system for scientific data Model-based mediator system for scientific data

management [Ludascher 2003]management [Ludascher 2003] Data sources provide semantic information regarding Data sources provide semantic information regarding

their datatheir data The provided information is used to generate domain The provided information is used to generate domain

model for a mediator systemmodel for a mediator system Assumption is that semantic information is provided Assumption is that semantic information is provided

by different data sources of interestby different data sources of interest

Page 24: Artemis: Integrating Scientific Data on the Grid

2424

ConclusionsConclusions Contributions: Contributions:

Mixed-initiative approach to help scientists query Mixed-initiative approach to help scientists query objects on the data gridobjects on the data grid

Isolate users from heterogeneity of data sourcesIsolate users from heterogeneity of data sources Manage distributed dynamic dataManage distributed dynamic data

Future Work:Future Work: Algorithm to determine when to dynamically generate Algorithm to determine when to dynamically generate

domain modeldomain model Better support for specifying model mappingsBetter support for specifying model mappings Artemis available as a grid serviceArtemis available as a grid service More extensive testing and usability studiesMore extensive testing and usability studies

Page 25: Artemis: Integrating Scientific Data on the Grid

2525

??