overview and motivation of the icat software suite kerstin kleese van dam

Post on 17-Jan-2016

221 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Overview and Motivation of the ICAT

Software SuiteKerstin Kleese van Dam

Science and Technology Facilities CouncilSTFC employ more than 2200 staff

who are deployed at 7 locations, these are: Swindon where the

headquarter is based, the Rutherford Appleton Laboratory, the Daresbury Laboratory, the Chilbolton Observatory, the UK Astronomy Technology Centre in Edinburgh, the Isaac Newton Group of Telescopes on La Palma; and the Joint Astronomy Centre in Hawaii.

Research and Science Support at STFC

Deliver world class science Engender world class science

Communicate world class science

Annually over 15000 visiting Scientists from around the world from both Academia and Industry.

Why an Integrated e-Infrastructure is required

HPC

HPCAnalysis

Storage

Storage

Analysis

Experiment

ExperimentComputing

HPC

Scientist

What STFC aim to achieve with their e-Infrastructure

Enabling users to get rapid access to their current and past data, related experiments, publications etc., leading to improved analysis through more complete information.

Creating a powerful, long lasting scientific knowledge resource.

Integrated e-Infrastructure

Proposal

Metadata Catalogue

Information

Experiment

Data Acquisition

System

Secure Storage

Data

Analysis

Publication

E-PubsProposal System

All Data and Metadata Capture is automated.

e-Infrastructure – Access to Multiple

Facilities(2)

Data Portal

SNS - ORNL

ISIS – TS1 + 2

DLS

CLF

CSL - Canada

SRS + ERLP

How we achieve the integration

HPC

HPCAnalysis

Storage

Storage

Analysis

Experiment

ExperimentComputing

HPC

Metadata

Scientist

ICAT Software Suite

The ICAT software suite centrally catalogues all experiment related information and extracted key results. Where ever possible information is gathered automatically trough integration with existing IT systems such as proposal systems or data acquisition.The catalogue and the data it references are accessible via a well defined API for easy embedding into any applications.

Distributed Data

Metadata Catalogue

Generic Catalogue Access Interface

Data Access and Analysis Applications

Underlying Data Infrastructure

Online Proposal System

User Office System incl.:

User Database

Scheduling

Health and Safety

Proposal Management

Metadata Catalogue

Data Acquisition System

Storage Management

System

DataAccessPortal

Single Sign On Account Creation and Management

ICAT Software Suite, providing the crucial integration of key functions.

The online proposal system is the entrance point to the Data Management System, and is a rich source of contextual information about the users experiment.

ICAT and the STFC Proposal Systems

ICAT and STFC Data Acquisition

Plug-ins for the data acquisition system ensure automatic, quality controlled collection of data and metadata. ICAT can be easily linked to any existing system.•ISIS :

- SECI (C#, .net) with link to LabView and openGenie•DLS :

- Generic Data Acquisition (Java, on top of EPICS) •CLF :

- For Laser Diagnostics, (LabView)

ICAT and DLS Storage Management

DLS uses the Storage Resource Broker for its Storage Management, this has been integrated with ICAT for data access and delivery.

Main advantage : •Decoupling physical file location from the logical one.•Strict Security•Expandable to many storage systems

SRB MCATDatabase

SRB MCATServer

SRB ADSServer

SRBClient

SRB DiskServer (Local Server)

Atlas Data Store

SRB MCATDatabase

SRB MCATServer

SRB ADSServer

SRBClientSRBClient

SRB DiskServer (Local Server)

SRB DiskServer (Local Server)

Atlas Data StoreAtlas Data Store

ICAT and ISIS Storage Management

ISIS uses their own in house developed data storage access system called Data.ISIS.

Similar to SRB it abstracts from the physical location of the files and delivers the same advantageous in terms of decoupling of logical and physical location of files and security.

ICAT Architecture

Online Proposal System

User Office System incl.:

User Database

Scheduling

Health and Safety

Proposal Management

Metadata Catalogue

Data Acquisition System

Storage Management

System

DataAccessPortal

Single Sign On Account Creation and Management

ICAT Software Suite, providing the crucial integration of key functions.

ICAT 3.3 Aims and Objectives

• ICAT API Version 3.3 aims to be the Grid aware software infrastructure that enables

applications to exploit the capabilities of the ICAT catalogue.

• Data Portal Version 3.3 aims to be the Grid aware software infrastructure that serves the Data Search and Retrieval (DSR) requirements of the STFC. It makes use of the ICAT API 3.3 .

Overall Architecture Principles

• The ICAT software suite has a modular design with clear functional boundaries for each component.• Core functionalities have been grouped together, customisable presentation layers are separated from the function layer to achieve easy maintenance, easy customisation, insulation from changes to underlying areas.• All interaction with the ICAT catalogue are now through the ICAT API.

Core Scientific Metadata

Model

(CSMD)

Rich Data at STFC

Scientific Data of the highest Quality is produced at STFC Facilities and

Departments.The continuity and longevity of STFC has

led to a unique wealth of Information.

How about a system that would give access to all of it independent of where it was produced?

Model Motivation (1)

Most Scientists think in terms of Studies during which they perform a number of investigations e.g. experiments,

observations, measurements and simulations. Results from these investigations usually run through

different stages: raw data, analysed or derived data and end results. Data should be grouped accordingly.

Metadata and Software (e.g. STFC DataPortal) should allow the user to search for interesting data.

Not all information captured in specific metadata schemas e.g. CML, would be used to search for this data or

distinguish one data set from another, give possibility to select special parameter.

Model Motivation (2)

A common general format/standard for Scientific Studies and data holdings metadata did not exist

By proposing Model and Implementation:– Form a specification for the types of metadata

studies should capture during Scientific Studies– Ease citation, collaboration, exploitation and

Integration– Allow easy Integration of distributed

heterogeneous metadata systems into a homogeneous (albeit virtual) Platform

Therefore – The Common Scientific Metadata Model (CSMD) developed.

General Layout

Why – i.e. what was the needWhat is it – description – support keyword searches and taxonmic approaches – data

organisation like a file systems but support linking to a database also

Where is it used – project & softwareWhat are the users likely to search on

What distinguishes one study/investigation/data set from the next

Metadata Model Structure

The Common Scientific metadata model (CSMDM) is a study-data set orientated model holding study information about:

– Topic Indexing– Provenance– Data Holding– Legal notes

• Copyright, patents and conditions of use etc relating to the study and the data in the study

– Related Material• Publications, Community

information and related links– (Access Conditions)

Metadata Granule

Topic

Study

Access Conditions

Related Material

Legal Note

Data Holding

Investigation 1 M

11

Atomic Data Object

Data Collection

11

1

M

M

M

M

1

Model Breakdown: Provenance

The Study contains the following metadata:

– The Study Name– The Study Institution– The Investigator– Extended Study Information

• Abstract• Funding • Start and End times

– Investigations

Investigations

A Study can have more than one investigation; possible

enumerations are experiment, simulation,

measurements etc. – investigations contain:

– Name– Investigation Type– Abstract– Resource– Link to DataHolding

Topic (for indexing)

Keywords– Discipline (i.e. domain)– Keyword Source (e.g.

domain dictionary)– Keyword

Subjects

– Discipline– Subject Source (e.g.

domain taxonomy)– Subject

Access Condition & Related Material

Access Conditions

– Contains a list of users or groups who are allowed access to the metadata and data, or a pointer to an access control system which contains such data for this study

Related Material

– One or many links and or textual descriptions of material related to this study e.g. earlier studies or parallel studies

Data

Data Description holds a logical description of the

Study’s data:

– Data Name– Type of Data– Status– Data Topic– Parameters– Related Data Ref– Relation type (e.g.

derived)

Data Location contains the link between logical name

(e.g. URI's) and physical URLs

– Data Name– Locator(s) (In the

case of Atomic Data Objects these can refer to files as well as named Selects on a database – i.e. virtual data objects)

More on Parameters

Parameters contain a lot of information about the atomic data objects (ADO) and collections

A collection/ADO can have many parameter entries, each parameter entry contains:

Parameter derivation (e.g. measured/fixed)

– The value– The units– Range – Error margin

Parameter aggregation is also supported

Cardinality Issues

The model recommends a certain cardinality of elements

Certain metadata components are necessary for one to have an instance of the implemented model – treating everything as

optional is not acceptableIt is though implementations may modify

this more to their needs – model attempts to remain ideal (i.e. most common Cardinality)

Enumeration Issues

Enumerations (or controlled vocabularies) e.g. types of investigator, types of institutions; these are distinct from the model e.g. as

taxonomies are.However they are necessary for the model to

work so implementations e.g. STFC DataPortal implementation of the model propose some

enumerations for common thingsRecognised and relevant controlled vocabularies are hoped to be used by implementations where

they are available

Conformance Level

For a complete metadata study-dataset record a large amount of metadata has to

be stored/processedSo it’s useful to have conformance levels

Model uses 5 levelsEach level specifies more metadata (and

Indexing information) should be held

Level 1

Type of Information captured:

– Study and Investigation metadata with indexing at the Study level

Level 1 metadata is similar to library/publication style metadata (e.g.

DublinCore)

Level 2

Type of Information captured:

– Level 1 + DataHolding metadata (i.e. DataSets and DataObjects)

Level 3

Type of Information captured:

– Level 2 + related material, Access condition, indexing to data collection levels

Level 4

Type of Information captured:

– Level 3 + indexing to data object level and data object parameter information

Level 5

Type of Information captured:

– All metadata components are filled as L4 + funding, resources used, facilities used etc

Conformance Levels

L1 is similar to library/publication style metadata (e.g. DublinCore)

The current DataPortal uses somewhere between L4 and L5 –the new systems designed with CSMD

conforms to L4+ Benefit of conformance levels; the higher the level of conformance to the CSMD the richer the clients

that operate on the data can be– e.g. identifying datasets and atomic data

objects which link directly to keywords/taxonomies and not just studies

CSMD Used on DataPortal

Implementation used as Data Interface for

DataPortalSingle view of heterogeneous

systems/schemasActs as a stress test of the

model– Limitations feed into

Model Requirements– New requirements

feed back into implementation

ICAT Schema 3.3

Specifics of the ICAT 3.3 Schema

ICAT 3.3 Schema - Facility

ICAT 3.3 Schema - Study

ICAT 3.3 Schema – Study (2)

ICAT 3.3 Schema – Study (3)

Study Investigation

Study Status

ICAT 3.3 Schema - Investigation

ICAT 3.3 Schema - Instrument

ICAT 3.3 Schema - Shift

ICAT 3.3 Schema – Shift (2)

ICAT 3.3 Schema - Keywords

ICAT 3.3 Schema - Topic

ICAT 3.3 Schema – Topic (2)

Topic

Topic List

What is an Ontology?

•Ontologies are used to capture knowledge about a domain of interest.

• An ontology describes the concepts in the domain and the relationships that hold between those concepts.

Advantages of Ontologies

• Provide increased flexibility when representing frequently changing viewpoints of information.

• Alterations can be simply followed up in the model without having to alter the applications on which they are based.

• Allows a unified view of heterogeneous data sources.

• Remove conflicts and terminological uncertainties.

• Facilitate Moderated searches, optimisation of the search results.

Why Ontologies are a useful Solution?

•At present over 1,700,000 keywords describing experiments are housed in ISIS ICAT many of which are synonyms.

•These keywords are used to index experimental studies, however this is seen as a limited method as these free text keywords have no context, and are hard to map by non-experts to terms used by facilities in the same domain and harder still to those outside.

•The creation of ontologies at ISIS will aid in the mapping of concrete manifestations of familiar terms in one domain as well as related concepts in different domains.

•This will facilitate searching of data by category and grouping of data into keywords across studies.

•This could aid in the cross facility searching of related scientific data from the various scientific facilities housed at STFC e.g. CLF and DLS.

A Protégé-OWL Ontology

• Classes• Individuals• Properties

A class is a concept in the domain - a class of People - a class of Pets - a class of Countries

A class is a collection of elements with similar properties.

Instances of classes- America can be an instance of the class Country.

Gemma

Mathew

Fluffy

Italy

America

England

Fido

Class Person

Class Pet

Class Country

livesIn

hasSibling

hasPet

ISIS Facilities Ontology Hierarchy

Class ISISExperiment

Class DataFile

Class Year

wasConductedIn

hasInvestigator

Class Instrument

Class InvestigatorHRP00145.RAW

1986

Pete Jones

HRPD

Class CrystallographyGroupExperiment

hasUsedInstrument

HydraziniumClass InvestigationTitle

hasTitle

hasDataFileName

Protein Crystallography GroupExperiment

ISIS Facilities Ontology

Sample, Investigator and Experiment Ontologies

Sample Investigator Experiment

Ontology Maintainer

•A web application for graphically displaying current versions of an ontology

•Currently ontologies are built within Protégé, an editing environment

•Difficulty in showing constructed ontologies to other domain experts

•The OntoMaintainer allows users to visualize ontology and enter feedback on the classification and structure of the hierarchy

•Encourages collaboration between domain experts (scientists) and ontology builders by allowing members of the community to be involved in the development and maintenance of ontologies

Topic Mapping Tool

•Mapping Tool provides a way of linking proposal system data to the structure of the ontology.

•Data is mapped to the ontology structure according to a set of defined rules.

Proposal System Database

Ontology Mapping Rules

Mapping Tool

Object

Sample Detail

Chemical Formula Name SampleType

LiquidLiquidpoly{1,4-phenylene-[9,9-bis(4-phenoxy butylsulfonate)]fluorene-2,7-diyl} ; C12E5;

D2O

poly{9,9-bis[6-(N,N-trimethyl-ammonium)hexyl]fluorene-co-1,4-phenylene};C12E5;D2=

C37H52N2I2:C22H46=6;D2=

C37H30S2O8; C22H46O6;D2O

•Ontologies would help maximise the value of data collected at ISIS and other STFC facilities by improving the access, navigation and reuse of data.

•Ontologies would facilitate the mapping of terms across STFC facilities which will allow cross-facility searching e.g. external users will be able to search for all experiments carried out across STFC using a powder diffractometer (instrument) even if they do not know the local names of the specific instruments.

•The OntoMaintainer will facilitate the process of creating and maintaining ontologies by providing a means of getting feedback directly from domain experts

ICAT 3.3 Schema - Investigator

ICAT 3.3 Schema – Investigator (2)

Investigator

Facility User

ICAT 3.3 Schema - Sample

ICAT 3.3 Schema – Sample Parameter

ICAT 3.3 Schema – Dataset

ICAT 3.3 Schema – Dataset (2)

ICAT 3.3 Schema – Dataset (3)

ICAT 3.3 Schema – Dataset Status

ICAT 3.3 Schema – Dataset Type

ICAT 3.3 Schema – Dataset Parameter

ICAT 3.3 Schema – Data File

ICAT 3.3 Schema – Data File (1)

ICAT 3.3 Schema – Data File (2)

ICAT 3.3 Schema – Related Data Files

ICAT 3.3 Schema – Data File Parameter

ICAT 3.3 Schema – Authorisation

Other ICAT Related Schema

ICAT API Session Schema

There are 3 tables to the schema, the user, user_session and myproxy_servers:

USER – All users who have logged inUSER_SESSION – All user’s sessions on IcatMYPROXY_SERVERS -- configuration information about which server to logon to

ICAT Core Database Schema

ICAT Core Session

ICAT Core Event

ICAT Core User

ICAT Core DataBase

ICAT Core Database

The core ICAT catalogue is at STFC run on an Oracle 10G RAC clustered database server.

The system has been customised to make efficient use of the offered features of Oracle.

If required these could however be removed in the future.

ICAT API

ICAT Architecture

Online Proposal System

User Office System incl.:

User Database

Scheduling

Health and Safety

Proposal Management

Metadata Catalogue

Data Acquisition System

Storage Management

System

DataAccessPortal

Single Sign On Account Creation and Management

ICAT Software Suite, providing the crucial integration of key functions.

ICAT API Version 3.3 (1)

The ICAT API version 3.3 is the interface that any application should use to interact

with the core ICAT system catalogue. At present it is used by applications such as the ISIS XML ingest, the DLS Generic Data

Acquisition System, DLS DDH and the DataPortal. The API offers a wide range of web services for the easy interaction with

the ICAT core catalogue.

ICAT API Version 3.3 (2)

The ICAT API version 3.3 consists of three main components:

•Web Services offered to other applications•ICAT Catalogue Interactions•ICAT Catalogue Session Management

ICAT API Version 3.3 (3)

The ICAT API version 3.3 uses JPL and SQL to directly interact with the underlying oracle databases

The ICAT API version 3.3 has been written in Java using EJB3, JPA and JAX-WS

ICAT API Version 3.3 (4)

Web Services offered to other applications for the Search, List, Ingest, Delete, Modification of:

AuthenticationInvestigation, Datafile and Dataset InformationInvestigatorKeywordsPublicationSampleDownload

DataPortal

ICAT Architecture

Online Proposal System

User Office System incl.:

User Database

Scheduling

Health and Safety

Proposal Management

Metadata Catalogue

Data Acquisition System

Storage Management

System

DataAccessPortal

Single Sign On Account Creation and Management

ICAT Software Suite, providing the crucial integration of key functions.

DataPortal for ICAT Version 3.3

The DataPortal is a highly customisable web interface to interact with the ICAT version 3.3.There are at present two distinctive versions one for ISIS and one for DLS. Whereas the underlying functionality is the same the graphical representation and choice of used services varies.The DataPortal offers a number of search interfaces, the ability to explore investigations and download associated data.

Top Left Hand Menu

Bottom Left Hand Menu

Top Right Hand Menu

Session Expire

DLS DataPortal

Questions?

top related