session 43 :: accessing data using a common interface: ogsa-dai as an example
DESCRIPTION
Speaker: Elias TheocharopoulosTRANSCRIPT
web: www.omii.ac.uk email: [email protected]
Sessions 43 & 44Accessing data using a
common interface: OGSA-DAI as an example
Elias Theocharopoulos and Tilaye Alemu
ISSGC ‘09 – Sophia Antipolis – Tuesday, 14th July 2009
web: www.omii.ac.uk email: [email protected]
2
Overview
• The problem: Sharing data in a grid• What is OGSA-DAI?• Data-centric workflows• Key OGSA-DAI terms• The OGSA-DAI client toolkit• Use cases and extensibility points• Pros and cons
web: www.omii.ac.uk email: [email protected]
Distributed data resources
web: www.omii.ac.uk email: [email protected]
Central server pros and cons
• Access to up-to-date data• Single point of access• Data in common format• Database can handle joins
• Initial overhead in terms of time, effort and cost
• Keeping data up to date• Loss of control by data providers
o Assuming they even let go
• Security and trust
web: www.omii.ac.uk email: [email protected]
How about providing direct access?
Client
UK data
ES query
ES data
IA query
IA data
Translate and join
UKquery
web: www.omii.ac.uk email: [email protected]
Direct access pros and cons• Access to up-to-date data• Fast access• Data providers retain control
• Fat clients• Heterogeneity and inconsistency
o Data o Databaseso Connectiono Security
• Security overheads for data providerso Manage firewalls and usernames/passwords for multiple clients
• Hard to use in grid/web service workflows
web: www.omii.ac.uk email: [email protected]
How about providing a ZIP on the web?
ClientUnZIP, translate and join
HTTP GET
UK data ES data IA data
ZIP HTTP GET
ZIP HTTP GET
ZIP
web: www.omii.ac.uk email: [email protected]
ZIP on the web pros and cons
• Fast access• Data providers retain control
• Very large downloads even if client only needs subset
• Providers have to select and ZIP their data
• Client has to install data into a local database
• Static snapshot
web: www.omii.ac.uk email: [email protected]
Sharing distributed heterogeneous resources with OGSA-DAI
Client
OGSA-DAI
UK query
UK data
ES query
ES data
IA query IA
data
FR data
Translate and join
FR query
web: www.omii.ac.uk email: [email protected]
12
Motivation
• Grid is about sharing resources
• Need to share structured data resourcesRelational Database
XML Database
Indexed File
web: www.omii.ac.uk email: [email protected]
13
What is OGSA-DAI?• Open Grid Services Architecture Data
Access Integration• A framework that executes workflows• Workflows are data-centric• Workflow components are designed for
data access, integration, transformation and delivery
• Can access heterogeneous data resources• Webservice interface• Intended as a toolkit for building higher-
level application-specific data services
web: www.omii.ac.uk email: [email protected]
OGSA-DAI’s vision
• Sharing data resources to enable collaboration• Data access
o Structured data in distributed heterogeneous data resources
• Data integrationo e.g. expose multiple databases to users as a single virtual
database
• Data transformationo e.g. expose data in schema X to users as data in schema Y
• Data deliveryo To where it’s needed by the most appropriate means o e.g. web service, e-mail, HTTP, FTP, GridFTP
web: www.omii.ac.uk email: [email protected]
OGSA-DAI and data-centric workflows
web: www.omii.ac.uk email: [email protected]
OGSA-DAI workflow
• Executes workflows
• Workflows contain activitieso Well-defined functional unitso Data goes in, something is done, data comes
outo Equivalent to programming language methods
• Workflows are submitted by clientso To an OGSA-DAI web service
web: www.omii.ac.uk email: [email protected]
An OGSA-DAI workflow - a simply analogy
Pays Capital
l'Espagne
Madrid
l'Italie Rome
Pays Capital
Grande-Bretagne
Londres
France ParisConvert query from French to English
Convert query from French to English
Convert query from French to Spanish
Convert query from French to Spanish
Run SQL query
Run SQL query
Convert data from English to
French
Convert data from English to
French
Join the data
Join the data
País Capital
España
Madrid
Italia Roma
Country
Capital
UK London
France Paris
SELECT Country, Capital FROM Countries
SELECT País, Capital FROM Países
SELECT Pays,Capital FROM Pays
Run SQL query
Run SQL query
Convert data from
Spanish to French
Convert data from
Spanish to French
Pays Capital
Grande-Bretagne
Londres
France Paris
l'Espagne Madrid
l'Italie Rome
web: www.omii.ac.uk email: [email protected]
How it appears to the client
workflow(SELECT Pays,Capital FROM Pays)
Pays Capital
Grande-Bretagne Londres
France Paris
l'Espagne Madrid
l'Italie Rome
Client
OGSA-DAI
web: www.omii.ac.uk email: [email protected]
21
Data integration with OGSA-DAI workflows• Across OGSA-DAI services
DB1OGSADAI
DB2
SQLQuery (DB1)
SQLQuery (DB2)
DeliverJOIN
Receive from OGSA-DAI
OGSADAI
Data
Deliver to OGSA-DAI
Workflow 1
Workflow 2
web: www.omii.ac.uk email: [email protected]
22
Key OGSA-DAI terms: activities, resources,
workflows
web: www.omii.ac.uk email: [email protected]
23
OGSA-DAI: Key Term Activity
• An activity is a named unit of functionality
o A well defined workflow unito Pluggableo Composable
• An activity can have o 0 or more named inputso 0 or more named outputs
• Blocks of data flow from an activity’s output into another activity’s input
web: www.omii.ac.uk email: [email protected]
24
OGSA-DAI: Key Term Activity (cont.)
• Example activities includeo Execute an SQL query o ZIP a batch of datao List the files in a directoryo Execute an XSL transform on an XML
documento Deliver data to an FTP server
web: www.omii.ac.uk email: [email protected]
25
OGSA-DAI: Key Term Activity (cont.)
• Activity Connectionso All required inputs must be connectedo All outputs must be connectedo Optional inputs
• Inputso Literalo Streamedo Types
web: www.omii.ac.uk email: [email protected]
27
Data grouping: Lists
• Special blocks are used to mark the beginning and the end of a list.
• A list groups related data as one unit.
• For example ReadFromFileActivity can dynamically take any number of filenames as input.
o Without a way to group the output byte arrays we would have no way to differentiate between the binary data of filenames f1 and f2.
o Streaming is preserved since for each file a number of byte arrays is produced to be forwarded to coming activities.
ReadFromFileActivityf1,f2
[byte[]…],[ byte[]..]
web: www.omii.ac.uk email: [email protected]
28
Passing data internally: OGSA-DAI Tuple• A special type of data passing between
activities• A Tuple is a data representation similar
to a row of relational data. Each element of a Tuple represent a column.
• Tuples are normally grouped in lists and they are preceded by a metadata block.
Athens 20
Madrid 22
Rome 25SqlQuery
SELECT city, temp FROM weather;
web: www.omii.ac.uk email: [email protected]
29
An interesting activity: Tee
• There are activities that operate on the level of blocks and are not concerned with the type and values of data they are handling. E.g TeeActivity:
TeeActivity[A,B,C,D]
[A,B,C,D]
[A,B,C,D]
No of outputs: 2
web: www.omii.ac.uk email: [email protected]
30
OGSA-DAI: Key Term Resource
• Data request execution resource• Data resources• Data sources• Data sinks• Sessions
o A state container associated with a set of workflows
o One workflow can lodge stateo A subsequent workflow can retrieve it
• Requestso One per workflow submitted to a DRERo Access request status
web: www.omii.ac.uk email: [email protected]
31
OGSA-DAI: Key Term Workflow
• A workflow can contain:o Activities
• Resource-based: SQLQuery• Non-Resource:
Transformation and Delivery
o Resources• Targeted by Activities
o Other Workflows• Sub workflows• Other types of workflow
web: www.omii.ac.uk email: [email protected]
32
OGSA-DAI: Key Term Workflow (cont’)• OGSA-DAI can be used as a workflow
processing system that is designed to stream data through a set of activities in a pipelined manner.
• In the Query->Transform->Deliver workflow, if the activities are well defined all three will be processing concurrently with different portions of the data stream.
web: www.omii.ac.uk email: [email protected]
33
OGSA-DAI: Key Term Workflow (cont’)• Pipeline workflow consists of a set of chained
activities that will be executed in parallel with data flowing between the activities.
• Sequence workflow all the sub-workflows added to this workflow will be executed in sequence.
For example 1st sub-workflow in a sequence creates a table, 2nd bulk loads transformed data into this table.
• Parallel workflow all the sub-workflows added to this workflow will be executed in parallel.
1
2
web: www.omii.ac.uk email: [email protected]
34
Getting to the first practical: The OGSA-
DAI client toolkit.
web: www.omii.ac.uk email: [email protected]
35
OGSA-DAI client toolkit
• OGSA-DAI client toolkito Construct and submit requests in Java not
XML• Toolkit manages interaction with web services
via SOAP over HTTP; it handles SOAP request construction and response parsing.
o Provides Java abstractions of• Services• OGSA-DAI resources and properties• Requests• Activities
web: www.omii.ac.uk email: [email protected]
36
The client toolkit
• The workflow description is sent to the OGSA-DAI server as an XML document.
• Application developer does not need to worry about creating this document.
• The client toolkit provides ways of assembling activity workflows programmatically.
• We will see how to use the client toolkit during the hands-on session.
web: www.omii.ac.uk email: [email protected]
37
Data Request
Execution Service
Data Request Execution Resource
Client
Data Resource Data
Data Resource Data
Data Resource Data
SessionSessionRequestRequest Management
Service
MyDRER
One
Two
Three
MyRequest123456
Service/resource model
web: www.omii.ac.uk email: [email protected]
38
Client Toolkit Activities
• One client activity per server activity• Same input and output names• Plus some convenience methodsFor example:• Retrieve results as a JDBC ResultSet
from a TupleToWebRowSet activity.• Retrieve update count as an Integer
from a SQLUpdate activity
web: www.omii.ac.uk email: [email protected]
39
Step by Step Guide for Writing Clients• Create activities
o There’s a corresponding client toolkit activity for each server-side activity
DeliverToFTP deliver = new DeliverToFTP();ReadFromFile readFile = new ReadFromFile();
web: www.omii.ac.uk email: [email protected]
40
• Set inputs for each activity (e.g. parameters)
• Every input parameter can either be literal input or streamed from another activity
o Literal inputs, e.g. for constant parameters:
o Connect input to the output of another activity to stream data
Connecting activities
deliver.connectDataInput(readFile.getDataOutput());
deliver.addFilename("results1.txt");deliver.addHost(“[email protected]:21");
web: www.omii.ac.uk email: [email protected]
41
Gaining access to the results
• If the output of an activity can be provided in a user-friendly type, then there are methods to access the results:
o Check whether there are more results to be retrieved
o Get the next result in a convenient type
boolean hasNext = sqlUpdate.hasNextResult();
int count = sqlUpdate.getNextResult();
web: www.omii.ac.uk email: [email protected]
42
Build and execute the Workflow Request• Create workflow and add activities to
them• A data service executes the workflow
and returns a response (or an error!)• The response may contain data
(depending on the activities)• Each client toolkit activity provides utility
methods for retrieving its response data
web: www.omii.ac.uk email: [email protected]
43
First hands-on session
Go to : http://homepages.nesc.ac.uk/~elias/issgc09/html/
practical.html
web: www.omii.ac.uk email: [email protected]
45
Extending OGSA-DAI: What
• OGSA-DAIo A Frameworko Extensible
• Out of the Box is the basicso Different applications have different needso New Sources of Datao New Functionality
web: www.omii.ac.uk email: [email protected]
46
Extending OGSA-DAI: Overview
Data Sink
Data Source
Request
OMII
Activity Framework
GT Axis UNICORE WS-DAI ?
Workflow Execution Engine
gLite Embedded
Presentation Layer
SQ
LQuery
XP
athQuer
y MyO
wnA
ctivity
DeliverT
oUR
L
Data ResourcesX
SLT
ransform
OGSA-DAI Core
Sessions
Persistence and Configuration
New Types of Data
New Functionality
New Message Frameworks
web: www.omii.ac.uk email: [email protected]
47
Extending OGSA-DAI: Activities
• Activities do some unit of work• Specific transformation
o Data Format: SWISS-PROT to format X
• Deliveryo Deliver to a target service
• Data analysis and Integrationo Combine data from different sources
web: www.omii.ac.uk email: [email protected]
48
Extending OGSA-DAI: Resources
• New resources – why?o New Productso New Applicationso Specialised Access
• Required:o DataResourceo DataResourceStateo ResourceAccessor
web: www.omii.ac.uk email: [email protected]
49
Extending OGSA-DAI: Remote Resource
• Accessing Resources on Remote OGSA-DAI
• Avoid replication of resources• Security Issues
o Devolved to Local OGSA-DAIo Security between OGSA-DAI Deployments
web: www.omii.ac.uk email: [email protected]
SQL views• Define a drPatient view
o SELECT id, name, age, sex, doctor.name as drName FROM patient, doctor WHERE patient.DrID = doctor.ID;
• Client runs SELECT * FROM drPatient;• Shorthand for complex query results• Data access control e.g. users of drPatient
o Cannot access a patient’s ZIPo Are unaware of the doctor or patient tables
ID Name Age Sex
ZIP Dr ID
1 Ken 42 M IL1478305
456
2 Josie 25 F BN1 7QP 789
ID Name DN
123 Greene US-Chicago-G
456 Ross US-Chicago-R
789 Fairhead UK-Holby-F
web: www.omii.ac.uk email: [email protected]
OGSA-DAI SQL views
• OGSA-DAI SQL views data resourceo Represents a view across a database
exposed by an OGSA-DAI relational resource
• SQLQuery activityo Parses queryo Splices in view definitiono Submits transformed query to database
• Can define views for read-only databases
• Schema transformationo Map a logical schema to a physical schema
web: www.omii.ac.uk email: [email protected]
Distributed query processing
• OGSA-DQP o Developed by Universities of Manchester and Newcastleo Refactored for OGSA-DAI 3.0 by EPCC as part of the NextGrid
projecto OGSA-DAI DQP package
• Multiple tables on multiple databases are exposed to clients as multiple tables in one “virtual database”
• Clients are unaware of the multiple databases• Databases can be exposed
o EITHER within one OGSA-DAI servero OR via multiple remote OGSA-DAI servers
web: www.omii.ac.uk email: [email protected]
OGSA-DAI DQP
OGSA-DAI (DQP query evaluator)
Client
OGSA-DAI (core + DQP coordinator)
5: Results
4: Push results3: Execute sub-queries
2: Parse query and form query plan
OGSA-DAI
3b: SELECT Annotations_Ratings.ID,
Annotations_Ratings.Confidence FROM Annotation_Ratings
WHERE Annotations_Ratings.Confidence
> 0.99
3a: SELECT Archeo_Finds.ID,
Archeo_Finds.Provenance FROM Archeo_Finds;
OGSA-DAI
1: SELECT Archeo_Finds.ID, Archeo_Finds.Provenance, Annotations_Ratings.Confidence FROM Annotations_Ratings,
HGV_June WHERE Annotations_Ratings.Confidence > 0.99 AND Annotations_Ratings.ID = Archeo_Finds.ID;
5: Combine and post-process – do the JOIN
web: www.omii.ac.uk email: [email protected]
OGSA-DAI workflows – a de-facto standard• OGSA-DAI workflows are a de-facto standard
o Of use to many projects as we’ll see
• For some applications workflows are too powerful
o Too expressiveo Infer semantics from names of activities available on
server• Must interrogate the server
o Problems using OGSA-DAI services in workflow engines e.g. Taverna
o Not compatible with existing data analysis tools
web: www.omii.ac.uk email: [email protected]
Facades
• Define facades on top of OGSA-DAI• Why?
o Provide interfaces with more tightly-defined semanticso Comply with standardso Exploit existing data analysis tools
• Continue to exploit the power of workflows under-the-hood
o “Canned workflows”o Templates selected and populated, executed and
parsedo Map service operations to “template” OGSA-DAI
workflows
web: www.omii.ac.uk email: [email protected]
Grid-enabling existing data-related products
Data analysis tool
OGSA-DAI
OGSA-DAI mediator
web: www.omii.ac.uk email: [email protected]
OGSA-DAI in action
web: www.omii.ac.uk email: [email protected]
VOTES – data with different schema distributed across multiple databases within a group of strategic partners
• Virtual Organisations for Trials and Epidemiological Studies (VOTES)
o http://labserv.nesc.gla.ac.uk/projects/votes/index.html o UK Medical Research Council project
• Data access and integration in the clinical domain
o Relational databases – Microsoft SQL Server, Access, …
o Distributed database joins• Patient information• Clinical trials records
o Linking key is Scotland’s CHI number
web: www.omii.ac.uk email: [email protected]
VOTES – cross-database join activity
• This is equivalent to running:
SELECT chi, sex, DOB, diagnosis FROM patients, trialX WHERE patients.chi = trialX.chi;
• patients and trialX are in two different databases
DB1OGSADAI
workflow
DB2
SQLQuery(DB1)
SQLQuery(DB2)
MergeJoin
(CHI, Sex, DOB, Diagnosis)
(CHI, Sex, DOB)
(CHI, Diagnosis)
Ordered datastreams
SELECT CHI, Sex, DOBFROM PatientsORDER BY CHI
SELECT CHI, DiagnosisFROM TrialXORDER BY CHI
Deliver
web: www.omii.ac.uk email: [email protected]
Public Health Grid – data with different schema distributed across multiple databases within a group of strategic partners
• US Public Health Grido US Centers for Disease Controlo University of Pittsburgho Tarrant Country Public Health Departmento Dallas County Public Health Department
• Real-time Outbreak and Disease Surveillanceo Health query systemo Look for incidences of some disease on the rise over an
areao Historical and live data
• Health centres maintain their own databaseso Distributed databaseso Different products and schemas
• e.g. PatientID, Id, PatientIdentifier, PatientNumbero Security and privacy is important
web: www.omii.ac.uk email: [email protected]
Public Health Grid – workflows, DQP and views
DB1
OGSADAI
workflow
DB2
SELECT zip, count(*) as totalFROM CasesWHERE Reason = “Flu”GROUP BY zipORDER BY zip
SQLQuery(DB6)
DB4 DB3View
(15112, 3)
(15144, 1)
DB5
OGSA-DQP
DB6 View
Cases:SELECT * FROMDB1.Cases UNION DB2.Cases UNIONDB4.Cases
OGSA-DAI
OGSA-DAI
OGSA-DAI
web: www.omii.ac.uk email: [email protected]
SEE-GEO – working with private and public data
• SEcurE access to GEOspatial serviceso http://edina.ac.uk/projects/seesaw/seegeo/
index.html o EDINA, MIMAS, NeSC, NCeSSo UK JISC project
• Geographical information systems• Virtual integration of and access control to
o Census data – geo-data access serviceo Borders data – web feature serviceo Data hosted by other organisations and exposed
as services
web: www.omii.ac.uk email: [email protected]
SEE-GEO – geo-linking service portal
GLS Portal
Deliver
Deliver
Transform
Transform
JoinJoinGetGet
GetGet
Maps
1: GLSQuery submited via
portal e.g. “Leeds population
distribution by census output
area”
4: URL of image is returned to portal – avoids costly SOAP/HTTP transfer of image
5: Portal gets image using URL
Image Creation Service
MIMASCensus
UK
BORDERS
OGSA-DAI
2: Workflow is populated with query parameters and run
3: Image is placed on a map
server
web: www.omii.ac.uk email: [email protected]
Why OGSA-DAI?
web: www.omii.ac.uk email: [email protected]
Workflows
• A workflow can represent a complex data management scenario, involving:
o Data accesso Transformationo Filteringo Updating o Numerous distributed, heterogeneous
databases
web: www.omii.ac.uk email: [email protected]
Workflows and performance
• OGSA-DAI is one more layer between clients and data
• Therefore, OGSA-DAI is not as fast as a direct connection to a database
o OGSA-DAI uses JDBC so will never be as fast as a direct JDBC connection
• But this is not what OGSA-DAI is designed to do
web: www.omii.ac.uk email: [email protected]
Workflows and performance
• Having a server execute workflows yieldso Thinner clients with less memory and CPU requirementso Minimised client-server communication overheads
• Activities process data on the servero Minimises data movemento As opposed to BPEL or Taverna or web service-based
workflow engines which pass data to and fro via web services
• Data streamingo Activities work on different parts of the data stream in
parallelo Reduces memory footprint on servero Reduces execution time
web: www.omii.ac.uk email: [email protected]
Why another layer can be good
• Data providers retain control of their data• A place to hide database heterogeneities
o Yields thinner clients
• A place to enforce additional securityo Hide the actual location of the datao Filter the data according to the rights of clientso Manage access to federations, databases,
tables, documents, files, rows, lines
• A place to define views on read-only databases
web: www.omii.ac.uk email: [email protected]
Developing applications
• OGSA-DAI is highly extensibleo Data resources, activities, security,
presentation layers
• An enabling frameworko Save development timeo Focus on application-specific featureso Get standard functionalities out-of-the-box
• Queries, updates, transformations, deliveries
web: www.omii.ac.uk email: [email protected]
Portability
• OGSA-DAI is 100% Javao Runs under Windows, UNIX, Linux
• OGSA-DAI uses web serviceso Clients can be written in any language and
on any platform that supports web services
web: www.omii.ac.uk email: [email protected]
76
Second and third hands-on sessions
Go to :http://homepages.nesc.ac.uk/~elias/issgc09/html/
practical.html#ScenarioTwoDataIntegration
web: www.omii.ac.uk email: [email protected]
Further information
• WWW site : http://www.ogsadai.org.uk • Info : [email protected] • Users e-mail list : [email protected]