neil chue hong project manager, epcc n.chuehong@epcc.ed.ac.uk +44 131 650 5957 ogsa-dai requirements...

Neil Chue HongProject Manager, EPCC

N.ChueHong@epcc.ed.ac.uk+44 131 650 5957

OGSA-DAI Requirements

Gathering Exercise

2nd DIALOGUE workshop

eSI, 9-10 February 2006

2nd DIALOGUE workshop - 9-10 February 2006 2

OGSA-DAI Requirements Gathering

• Aims – learn more about the data access and integration challenges that

other projects are facing– use this information to inform the future development of the OGSA-

DAI software

• Timescale– Nov 2005 – Jan 2006

• Gatherers– Ally Hume– Amy Krause– Tom Sugden

Projects

• AstroGrid – (www.astrogrid.org) - distributed queries over large astronomy databases.

• Automed and ISpider– (www.doc.ic.ac.uk/automed/) and (www.ispider.man.ac.uk) – model-based

data integration and Grid-based informatics platform for proteomics.

• CancerGrid – (www.cancergrid.org) – storage and analysis of distributed data containing

clinical trial and lab data.

• ESSC – (www.nerc-essc.ac.uk[MA1]) – environmental and atmospheric simulations.

• Gold – (www.goldproject.ac.uk) – provides infrastructure for virtual organisations.

• NTRAC – (www.ntrac.org.uk) – similar to CancerGrid.

Structure of Meeting Reports

• Data– the kind of data that the project is concerned with, including the structure,

quantity and types of data resource.

• Queries– the types of queries that are performed against this data, including the query

languages used and the typical size of result sets.

• The problem– the main problems that the project are currently facing with regards to data

access and integration.

• What Can OGSA-DAI Provide?– the functionality that the project would like OGSA-DAI to provide.

• Checklist– summarises the importance of various aspects of data access and

integration for the project.

AstroGrid

• a number of distributed

databases, each of which

contains astronomical data

captured from different modalities

• Almost all the tables in these

databases contain a spatial

coordinate of each feature and

some numerical attributes

associated with that feature.

• want to do distributed queries

using their algorithmic domain-

specific joins.

Public/ I nternal Public services. Data Movement Efficient data movement is crucially important. Data Replication Replication was not considered important. Data Discovery Data discovery is not an issue as they already

have data discovery mechanisms. Transactions Transactions where not considered to be very

important. Most of the queries are read-only, apart from the production of temporary tables. I f there is an error they are happy to start the query again.

Security They considered security to be a real concern but this seemed to be more of a resource usage issue rather than data encryption. Here they were talking about issues such as restricting the amount of data a user's query can write to a database.

Reliability There was an emphasis on production level functionality such as:

ability to kill queries reduced requirement for server restart

Scalability The AstroGrid systems must scale well to

support many large distributed databases.

AutoMed and ISpider

• middleware to transform schemas from

different data sources (relational

databases, XML documents, etc.) and

evaluate distributed queries expressed in

their own IQL language.

• By creating a path of schema-

transformations, it is possible to federate

multiple data sources so that they appear

as a single data source to the user

• how to optimise distributed queries using

metadata such as data size, occurrence

of indexes, performance rates, etc.

• how to fit AutoMed into a grid architecture

Public/ I nternal Public services. Data Movement Efficient data movement is required. They are

interested in XML compression algorithms. Data Replication Important but not a part of ISpider or

Automed specifically, though they may want to replicate data for query optimisation purposes.

Data Discovery The AutoMed APIs already provide a form of registry, but in the future a web service interface is envisaged.

Security Academics are fairly relaxed but in the broader scheme this is very important, particularly for use on medical data.

Scalability Important

CancerGrid

• By analysing laboratory data and

correlating it with hospital and trials

data, it is hoped that new subsets of

patients can be discovered who

respond best to particular treatments

• Security is a major concern because

many of the owners of data are

aware of the value of their data and

consequently are concerned about

who has access to it.

• A good means of transforming trial

forms (XML documents) into a format

suitable for automatic insertion into

relational tables is required.

Public/ I nternal Public but only accessible by certain users. Data Movement Important for distributed data integration use-

case. Data Replication No real requirement at the moment. Data Discovery They envisage a peer-to-peer system for data

discovery, but also plan to use the national registries that are being developed elsewhere.

Transactions Distributed transactions are not a concern at the moment. Updates take place daily at most and queries will not take place during updates.

Security I t is vital to be able to expose subsets of data to particular users. This is not just because of patient confidentiality (data in anonymised before it reaches CancerGrid), but also because of commercial interests.

Reliability Data integrity is important, but there is no need for 24hr services, so downtime is not a problem.

Scalability Must scale to many databases distributed around the world.

• dealing with large data sets of between 2

to 3 terabytes, stored mostly on a single

machine. The user requests portions of

data, often assembled from various files.

• Uniform web service interfaces are

provided for accessing data sets using the

standard APIs associated with the binary

data file formats that are used (netCDF,

GRIB, HDF, etc.).

• The queries used by ESCC are currently

synchronous which causes request

timeout problems when the resulting

datasets are large. Sceptical of current

WS-Notification implementations that

require open ports on client machines.

Public/ I nternal Both public and commercial. Data Movement Efficient movement of large binary files (Gbs)

is required. Data Replication Not important because this is handled by the

NERC data grid and datasets are generally copies of Met Office data.

Data Discovery The NERC data grid solves this problem using 4 levels of metadata, expressed in XML.

Transactions Transactional updates are quite important internally, but not important for end users.

Security Essential because of commercial nature of data.

Reliability Their services are fairly static and restarts are infrequent. The metadata store can already be updated without restarting services.

Scalability Linear scalability is desired for data extraction. The current file APIs scale well.

• develop an infrastructure to facilitate collaboration within

virtual organisations

• Data storage services will be used for capturing interactions

amongst parties of a VO in order to facilitate auditing and

VO-playback.

• Data analysis services will be used for performing particular

types of analysis of data existing mostly in relational

database back-ends.

• primary concern is managing security policies and service

access rights of different types of user dynamically.

• build platforms to bring different

systems together

• Many of the data resources that

they are accessing are stored in

private networks (e.g. NHS

patient information) with no open

gateway to the public.

• Researchers want to mine the

data to find people to recruit into

studies.

Public/ I nternal Internal. Data Movement Not a major concern - even if cross-site,

people are not normally interested in the raw data.

Data Replication Mainly for backup purposes not load-balancing.

Data Discovery The Scottish Executive and NCRN will be running registry of trials.

Transactions - Security Important because of private and commercial

nature of some data. Reliability There are usually sporadic queries made

against a dataset that does not change very quickly. NTRAC just recruits patients into the process so they are not concerned with reliability at the moment, but when getting involved in patient care, this is most important.

Scalability -

Prioritised Requirements

I D Requirement Priority R1 Efficient transportation of large quantities of data

between heterogeneous data resources. High

R2 Data federation and distributed query processing across heterogeneous data resources.

R3 An asynchronous model for processing large, long-running queries where the client can poll or be notified of the query status and the query can be terminated at an intermediate stage.

R4 The ability to provide different views of data resources to different users in a secure, DBMS-independent manner and to manage these views dynamically.

R5 Security/certificate delegation to allow access to other networks and role-based data access rules.

R6 Provision of more extensive database metadata capabilities, in particular with the inclusion of statistics relevant to query optimisation such as table size, occurrence of indexes and performance rates.

Medium

R7 Support for a unified query language (RDBMS-neutral), possibly through integration with Hibernate.

Medium

R8 Extensible join criteria for data integration, including support for spatial joins.

Medium

R9 The ability to limit the size of updates to data resources. For example, the size of temporary tables created during a SkyQuery-style [Ref] distributed query.

Medium

Notes on requirements

• Prioritised based on a judgement of their importance to the

various projects that were investigated. – Whether or not they are within the scope of the OGSA-DAI project, or

have already satisfied by OGSA-DAI, is not considered here.

• Frequent mention of the non-functional requirement: ease-of-

use. – Some concern that installation and configuration remains too complex

when compared with typical WAR-based web service deployment.

• Hope to publish the full document in near future– let me know if you want a copy

Conclusions

• Efficient transportation of large quantities of data between heterogeneous

data resources is a crucial requirement for several projects from distinct

domains. – This is also an implicit requirement for projects requiring data federation and

distributed query processing. – If we could solve this problem, it would be of great benefit to these projects,

and also to higher-level middleware projects such as OGSA-DQP

• Security remains a major concern because of the commercial and

sensitive nature of much data– want a generalised, role-based mechanism for exposing different views of

data resources to different users, and managing these views dynamically. – is this outside the scope of data integration middleware?

• While we were previously aware of most of the requirements described in

this document, associating them with actual projects can help with

prioritisation.

Further information

• The OGSA-DAI Project Site:– http://www.ogsadai.org.uk

• The DAIS-WG site:– http://forge.gridforum.org/projects/dais-wg/

• OGSA-DAI Users Mailing list– users@ogsadai.org.uk– General discussion on grid DAI matters

• Formal support for OGSA-DAI releases– http://bugs.ogsadai.org.uk/

• OGSA-DAI training courses

neil chue hong project manager, epcc n.chuehong@epcc.ed.ac.uk +44 131 650 5957 ogsa-dai requirements...

lab data

data replicationreplication

data encryption

astronomical data

analysis of distributed

types of data resource

multiple data sources

data discoverydata discovery

Documents

we are the 92% 16 november 2014, wssspe2, sc14, new...

introduction to ogsa-dai neil chue hong ogsa-dai project...

o 100629 chue

software sustainability institute the software...

writing perform documents epcc, university of edinburgh amy...

ogsa-dai lectures part 2 tom sugden, epcc tom@epcc.ed.ac.uk...

agile and open development neil chue hong, omii-uk ross...

international students’ motivation and learning … ·...

data format description language (dfdl) wg martin westhead...

amoeba&and&hpc& - prace research infrastructure ·...

ogsi on microsoft.net daragh byrne, ally hume, mike jackson...

jeremy nowell epcc, university of edinburgh...

http:// ogsa-dai data access and integration for the grid...

savvas petrou spetrou@epcc.ed.ac.uk epcc, the university of...

software sustainability institute activities and...

amy krause epcc a.krause@epcc.ed.ac.uk@epcc.ed.ac.uk...

paul graham software architect, epcc...

http:// ogsa-dai presented by mike mineter (most) slides...

amy krause applications consultant, epcc...

terry sloan epcc, the university of edinburgh...