design of a grid enabled database system to facilitate reuse, provenance tracking and automated...

24
Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University of Southampton

Upload: sara-blair

Post on 28-Mar-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

Design of a Grid Enabled Database System to Facilitate

Reuse, Provenance Tracking and Automated Processing of

Chemical Information

Robert GledhillUniversity of Southampton

Page 2: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

He who would change the world should first change himself

• We are building a system to automate the management of some of our data and compute resources, and provide an interface to allow people we choose, either inside or outside the university, to make use of these as we see fit

• We would also like to provide general web access to all the data we are legally entitled to

Page 3: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

General Aims of Project

• To automate the calculation of molecular properties from experimental information

• To simplify the development of new property calculation algorithms

• To provide a storage mechanism for this information, along with the original structures and measurements

• To track the provenance of individual items of information

• Develop a system with both chemist-friendly and script-friendly frontends

Page 4: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

What is the Data?

• Crystal structures from the NCS

• Crystal structures from elsewhere

• Experimentally measured physical properties from diverse databases, both public and private

• Properties derived from the experimental data by calculation

Page 5: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

Who are the Users?

• NCS, as a test bed system

• Grad students working in computational chemistry, developing new ways of deriving unkown physical properties from known ones

• Organic chemists, who should benefit from the pooling of diverse sources of information

Page 6: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

What Do We Want to Calculate?

• pKa values from QM calculations

• Electron densities, polarisabilities, etc. from QM calculations

• Diffusion constants, RDFs, etc. from MC

• Binding affinities to proteins

• QSAR properties

• Statistically calculated solubilities

Page 7: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

What Type of User Interfaces are Needed?

• A user friendly one!• Many of the users are anticipated to have

a straight chemistry background• For those users with a higher degree of

computer sophistication, a WSDL API will make scripting their jobs easier

• All interaction between the system and its users goes through a single chokepoint: the webserver

Page 8: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

What Hardware Do We Have at the Moment?

• A dual Xeon server machine• A RAID array, currently with about a T of space,

but easily expandable• A spare machine to use as an internal firewall• A cluster of linux machines, dedicated to running

calculations and under our control• A number of other machines dotted around the

department have particular single-seat license software on

Page 9: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

Security: What are We Very Worried About?

• An external user compromising the server and using it to attack other machines, either inside or outside the university firewall

Page 10: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

Security: What are We Less Worried About?

• A remote user compromising the server and damaging the software system or the data stored on it – so long as any irreplaceable data is backed up, we just reboot, reinstall, patch the hole and continue

Page 11: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

Security: The Firewall

• Only one machine – running the web server - should be reachable from outside the university firewall

• If we assume that the morass of perl/python/etc. CGI scripts on this machine are inherently hard to secure, then the webserver itself must be considered unsafe

• We need an internal firewall pointing towards the server machine, blocking most traffic out from it!

Page 12: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

Security: Access Control

• Authentication is by means of Combechem certificates

• Authorisation is controlled by the local system administrators

• No direct access to the database is allowed: everything goes through the WWW/WSDL interface – the server software is implicitly trusted not to break consistency

Page 13: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

Architecture

• The firewall comes between the web server and the rest of the campus network

• The web server machine also runs the database (in the present design)

• An internal dispatcher machine connects to the web server to check for jobs that need doing or to provide the results from them

• The dispatcher machine communicates with other machines running calculation web services

Page 14: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

Web Services: What?

• Take a piece of code that calculates some useful chemical information

• Write a wrapper around this that provides an API in a standardised format

• Add authentication/authorisation checking to the wrapper

• Add the appropriate hooks into the dispatcher and database to interface with this

Page 15: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

Web Services: Why?

• Now a user of the website with the correct authorisation can ask for the newly wrapped calculation to be performed on a selection of molecules, and the generated information to be inserted into the database (along with metadata noting who asked for the calculation to be done, when, what program version, etc.)

• The web service wrapping should streamline and simplify this sort of task

Page 16: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

Database: Requirements

• Store information of many different data types (e.g. boiling point, 3d structure)

• Cope with multiple units (e.g. Celsius, Kelvin)• Cope with conditions (e.g. Boiling point at 1 atm.

Pressure)• Cope with multiple forms of a molecule (e.g.

stereoisomers)• Cope with degenerate datasets (e.g. 5 different

measurements of the melting point, along with values calculated by 9 different versions of a particular algorithm)

• Retain information about the provenance of dataset items

Page 17: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

Database: Precedents

• The most common type of database is the ‘relational’ scheme, where data is thought of as being stored in tables

• A database which deals with most of our requirements (degeneracy, in particular) is DTHERM, a private store of thermodynamic data on organic molecules

Page 18: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

DTHERM

• DTHERM is a monument to what can be achieved with the relational database model

• It has many, many tables, and is very, very complicated

• Many tables have no single primary key, but require subsearches to achieve halfway reasonable speeds

• If we choose to go down the SQL route, the properties database will likely end up looking like DTHERM

Page 19: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

A Saner Path?

• An alternative to the straight relational model was drawn to our attention: Triplestore

• This is a database whose structure is described not by tables, but by ‘subject’, ‘predicate’, ‘object’ triples

• Effectively, one creates a graph of relationships between entities, and search by specifying subgraphs of this

Page 20: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

Triplestore

• We are currently experimenting with this form of database

• The description of the database and its queries, while strange and new, seems more straightforward than something like DTHERM

• The impression created is one of working with the database, which contrasts to that given by DTHERM, whose designers seemed to have been fighting the relational model every step of the way

Page 21: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

A Primary Key

• We would like a single identifier for a given molecular structure

• We have been working with the INCHI codes to do this

• We have a command line linux application to generate these

• Some sort of substructure searching would be nice for this

Page 22: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

CIFs

• We are going to store CIFS more or less as-is

• We will then extract out (to begin with) just those pieces we are most interested in

• These will be inserted into the database, with the original CIF file still available for those interested in the extra data contained in it

Page 23: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

A Project in Motion

• We aim to have a working system by the first quarter of next year

Page 24: Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University

Thank You

• Jeremy Frey• Jonathan Essex• Mike Hursthouse• Simon Coles• Everyone from ITI• Steve Harris• Keiron and Jamie• You, the audience