neil chue hong project manager, epcc n.chuehong@epcc.ed.ac.uk +44 131 650 5957 ogsa-dai requirements...
Post on 19-Jan-2016
219 Views
Preview:
TRANSCRIPT
Neil Chue HongProject Manager, EPCC
N.ChueHong@epcc.ed.ac.uk+44 131 650 5957
OGSA-DAI Requirements
Gathering Exercise
2nd DIALOGUE workshop
eSI, 9-10 February 2006
2nd DIALOGUE workshop - 9-10 February 2006 2
OGSA-DAI Requirements Gathering
• Aims – learn more about the data access and integration challenges that
other projects are facing– use this information to inform the future development of the OGSA-
DAI software
• Timescale– Nov 2005 – Jan 2006
• Gatherers– Ally Hume– Amy Krause– Tom Sugden
2nd DIALOGUE workshop - 9-10 February 2006 3
Projects
• AstroGrid – (www.astrogrid.org) - distributed queries over large astronomy databases.
• Automed and ISpider– (www.doc.ic.ac.uk/automed/) and (www.ispider.man.ac.uk) – model-based
data integration and Grid-based informatics platform for proteomics.
• CancerGrid – (www.cancergrid.org) – storage and analysis of distributed data containing
clinical trial and lab data.
• ESSC – (www.nerc-essc.ac.uk[MA1]) – environmental and atmospheric simulations.
• Gold – (www.goldproject.ac.uk) – provides infrastructure for virtual organisations.
• NTRAC – (www.ntrac.org.uk) – similar to CancerGrid.
2nd DIALOGUE workshop - 9-10 February 2006 4
Structure of Meeting Reports
• Data– the kind of data that the project is concerned with, including the structure,
quantity and types of data resource.
• Queries– the types of queries that are performed against this data, including the query
languages used and the typical size of result sets.
• The problem– the main problems that the project are currently facing with regards to data
access and integration.
• What Can OGSA-DAI Provide?– the functionality that the project would like OGSA-DAI to provide.
• Checklist– summarises the importance of various aspects of data access and
integration for the project.
2nd DIALOGUE workshop - 9-10 February 2006 5
AstroGrid
• a number of distributed
databases, each of which
contains astronomical data
captured from different modalities
• Almost all the tables in these
databases contain a spatial
coordinate of each feature and
some numerical attributes
associated with that feature.
• want to do distributed queries
using their algorithmic domain-
specific joins.
Public/ I nternal Public services. Data Movement Efficient data movement is crucially important. Data Replication Replication was not considered important. Data Discovery Data discovery is not an issue as they already
have data discovery mechanisms. Transactions Transactions where not considered to be very
important. Most of the queries are read-only, apart from the production of temporary tables. I f there is an error they are happy to start the query again.
Security They considered security to be a real concern but this seemed to be more of a resource usage issue rather than data encryption. Here they were talking about issues such as restricting the amount of data a user's query can write to a database.
Reliability There was an emphasis on production level functionality such as:
ability to kill queries reduced requirement for server restart
Scalability The AstroGrid systems must scale well to
support many large distributed databases.
2nd DIALOGUE workshop - 9-10 February 2006 6
AutoMed and ISpider
• middleware to transform schemas from
different data sources (relational
databases, XML documents, etc.) and
evaluate distributed queries expressed in
their own IQL language.
• By creating a path of schema-
transformations, it is possible to federate
multiple data sources so that they appear
as a single data source to the user
• how to optimise distributed queries using
metadata such as data size, occurrence
of indexes, performance rates, etc.
• how to fit AutoMed into a grid architecture
Public/ I nternal Public services. Data Movement Efficient data movement is required. They are
interested in XML compression algorithms. Data Replication Important but not a part of ISpider or
Automed specifically, though they may want to replicate data for query optimisation purposes.
Data Discovery The AutoMed APIs already provide a form of registry, but in the future a web service interface is envisaged.
Security Academics are fairly relaxed but in the broader scheme this is very important, particularly for use on medical data.
Scalability Important
2nd DIALOGUE workshop - 9-10 February 2006 7
CancerGrid
• By analysing laboratory data and
correlating it with hospital and trials
data, it is hoped that new subsets of
patients can be discovered who
respond best to particular treatments
• Security is a major concern because
many of the owners of data are
aware of the value of their data and
consequently are concerned about
who has access to it.
• A good means of transforming trial
forms (XML documents) into a format
suitable for automatic insertion into
relational tables is required.
Public/ I nternal Public but only accessible by certain users. Data Movement Important for distributed data integration use-
case. Data Replication No real requirement at the moment. Data Discovery They envisage a peer-to-peer system for data
discovery, but also plan to use the national registries that are being developed elsewhere.
Transactions Distributed transactions are not a concern at the moment. Updates take place daily at most and queries will not take place during updates.
Security I t is vital to be able to expose subsets of data to particular users. This is not just because of patient confidentiality (data in anonymised before it reaches CancerGrid), but also because of commercial interests.
Reliability Data integrity is important, but there is no need for 24hr services, so downtime is not a problem.
Scalability Must scale to many databases distributed around the world.
2nd DIALOGUE workshop - 9-10 February 2006 8
ESSC
• dealing with large data sets of between 2
to 3 terabytes, stored mostly on a single
machine. The user requests portions of
data, often assembled from various files.
• Uniform web service interfaces are
provided for accessing data sets using the
standard APIs associated with the binary
data file formats that are used (netCDF,
GRIB, HDF, etc.).
• The queries used by ESCC are currently
synchronous which causes request
timeout problems when the resulting
datasets are large. Sceptical of current
WS-Notification implementations that
require open ports on client machines.
Public/ I nternal Both public and commercial. Data Movement Efficient movement of large binary files (Gbs)
is required. Data Replication Not important because this is handled by the
NERC data grid and datasets are generally copies of Met Office data.
Data Discovery The NERC data grid solves this problem using 4 levels of metadata, expressed in XML.
Transactions Transactional updates are quite important internally, but not important for end users.
Security Essential because of commercial nature of data.
Reliability Their services are fairly static and restarts are infrequent. The metadata store can already be updated without restarting services.
Scalability Linear scalability is desired for data extraction. The current file APIs scale well.
2nd DIALOGUE workshop - 9-10 February 2006 9
GOLD
• develop an infrastructure to facilitate collaboration within
virtual organisations
• Data storage services will be used for capturing interactions
amongst parties of a VO in order to facilitate auditing and
VO-playback.
• Data analysis services will be used for performing particular
types of analysis of data existing mostly in relational
database back-ends.
• primary concern is managing security policies and service
access rights of different types of user dynamically.
2nd DIALOGUE workshop - 9-10 February 2006 10
NTRAC
• build platforms to bring different
systems together
• Many of the data resources that
they are accessing are stored in
private networks (e.g. NHS
patient information) with no open
gateway to the public.
• Researchers want to mine the
data to find people to recruit into
studies.
Public/ I nternal Internal. Data Movement Not a major concern - even if cross-site,
people are not normally interested in the raw data.
Data Replication Mainly for backup purposes not load-balancing.
Data Discovery The Scottish Executive and NCRN will be running registry of trials.
Transactions - Security Important because of private and commercial
nature of some data. Reliability There are usually sporadic queries made
against a dataset that does not change very quickly. NTRAC just recruits patients into the process so they are not concerned with reliability at the moment, but when getting involved in patient care, this is most important.
Scalability -
2nd DIALOGUE workshop - 9-10 February 2006 11
Prioritised Requirements
I D Requirement Priority R1 Efficient transportation of large quantities of data
between heterogeneous data resources. High
R2 Data federation and distributed query processing across heterogeneous data resources.
High
R3 An asynchronous model for processing large, long-running queries where the client can poll or be notified of the query status and the query can be terminated at an intermediate stage.
High
R4 The ability to provide different views of data resources to different users in a secure, DBMS-independent manner and to manage these views dynamically.
High
R5 Security/certificate delegation to allow access to other networks and role-based data access rules.
High
R6 Provision of more extensive database metadata capabilities, in particular with the inclusion of statistics relevant to query optimisation such as table size, occurrence of indexes and performance rates.
Medium
R7 Support for a unified query language (RDBMS-neutral), possibly through integration with Hibernate.
Medium
R8 Extensible join criteria for data integration, including support for spatial joins.
Medium
R9 The ability to limit the size of updates to data resources. For example, the size of temporary tables created during a SkyQuery-style [Ref] distributed query.
Medium
2nd DIALOGUE workshop - 9-10 February 2006 12
Notes on requirements
• Prioritised based on a judgement of their importance to the
various projects that were investigated. – Whether or not they are within the scope of the OGSA-DAI project, or
have already satisfied by OGSA-DAI, is not considered here.
• Frequent mention of the non-functional requirement: ease-of-
use. – Some concern that installation and configuration remains too complex
when compared with typical WAR-based web service deployment.
• Hope to publish the full document in near future– let me know if you want a copy
2nd DIALOGUE workshop - 9-10 February 2006 13
Conclusions
• Efficient transportation of large quantities of data between heterogeneous
data resources is a crucial requirement for several projects from distinct
domains. – This is also an implicit requirement for projects requiring data federation and
distributed query processing. – If we could solve this problem, it would be of great benefit to these projects,
and also to higher-level middleware projects such as OGSA-DQP
• Security remains a major concern because of the commercial and
sensitive nature of much data– want a generalised, role-based mechanism for exposing different views of
data resources to different users, and managing these views dynamically. – is this outside the scope of data integration middleware?
• While we were previously aware of most of the requirements described in
this document, associating them with actual projects can help with
prioritisation.
2nd DIALOGUE workshop - 9-10 February 2006 14
Further information
• The OGSA-DAI Project Site:– http://www.ogsadai.org.uk
• The DAIS-WG site:– http://forge.gridforum.org/projects/dais-wg/
• OGSA-DAI Users Mailing list– users@ogsadai.org.uk– General discussion on grid DAI matters
• Formal support for OGSA-DAI releases– http://bugs.ogsadai.org.uk/
• OGSA-DAI training courses
top related