ticer summer school, august 24th 20061 ticer summer school thursday 24 th august 2006 dave berry...

73
TICER Summer School, August 24th 2006 1 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh www.nesc.ac.uk

Upload: lily-allen

Post on 28-Mar-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 1

Ticer Summer School

Thursday 24th August 2006

Dave Berry & Malcolm AtkinsonNational e-Science Centre, Edinburghwww.nesc.ac.uk

Page 2: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 2

Digital Libraries, Grids & E-ScienceDigital Libraries, Grids & E-Science

What is E-Science?

What is Grid Computing?

Data Grids Requirements

Examples

Technologies

Data Virtualisation

The Open Grid Services Architecture

Challenges

Page 3: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 3

Page 4: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 4

What is e-Science?What is e-Science?

• Goal: to enable better research in all disciplines• Method: Develop collaboration supported by

advanced distributed computation– to generate, curate and analyse rich data resources

• From experiments, observations, simulations & publications• Quality management, preservation and reliable evidence

– to develop and explore models and simulations• Computation and data at all scales• Trustworthy, economic, timely and relevant results

– to enable dynamic distributed collaboration• Facilitating collaboration with information and resource sharing• Security, trust, reliability, accountability, manageability and agility

Page 5: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

climateprediction.net and GENIE

• Largest climate model ensemble

• >45,000 users, >1,000,000 model years

10K2K

Response of Atlantic circulation to freshwater forcing

Page 6: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

6Courtesy of David Gavaghan & IB Team

Integrative Biology

Tackling two Grand Challenge research questions:

• What causes heart disease?• How does a cancer form and grow?

Together these diseases cause 61% of all UK deaths

Building a powerful, fault-tolerant Grid infrastructure for biomedical science

Enabling biomedical researchers to use distributed resources such as high-performance computers, databases and visualisation tools to develop coupled multi-scale models of how these killer diseases develop.

Page 7: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

BBiomedical iomedical RResearch esearch IInformatics nformatics DDelivered by elivered by GGrid rid EEnabled nabled SServiceservices

Glasgow Edinburgh

Leicester Oxford

London

Netherlands

Publically Curated Data

Private data

Private data

Private data

Private data

Private data

Private data

CFG Virtual Organisation Ensembl

MGI

HUGO

OMIM

SWISS-PROT

… DATA HUB

RGD

SyntenyGrid

Service

blast

+

Portal

http://www.brc.dcs.gla.ac.uk/projects/bridges/

Page 8: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 8

eDiaMoND: Screening for Breast CancereDiaMoND: Screening for Breast Cancer

1 Trust Many TrustsCollaborative WorkingAudit capabilityEpidemiology

Other Modalities-MRI-PET-Ultrasound

Better access toCase informationAnd digital tools

Supplement MentoringWith access to digitalTraining cases and sharingOf information acrossclinics

LettersRadiology reportingsystems

eDiaMoNDGrid

2ndary CaptureOr FFD

Case Information

X-Rays andCase Information

DigitalReading

SMF

Case andReading Information

CAD Temporal Comparison

Screening

ElectronicPatient Records

Assessment/ SymptomaticBiopsy

Case andReading Information

Symptomatic/AssessmentInformation

Training

Manage Training Cases

Perform Training

SMF CAD 3D Images

Patients

Provided by eDiamond project: Prof. Sir Mike Brady et al.

Page 9: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 9

E-Science Data ResourcesE-Science Data Resources

• Curated databases– Public, institutional, group, personal

• Online journals and preprints

• Text mining and indexing services

• Raw storage (disk & tape)

• Replicated files

• Persistent archives

• Registries

• …

Page 10: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006

©

10

EBank

Slide from Jeremy Frey

Page 11: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006

©

11

Biomedical data – making connections

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg

Slide provided by Carole Goble: University of Manchester

Page 12: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 12

Using Workflows to Link ServicesUsing Workflows to Link Services

• Describe the steps in a Scripting Language• Steps performed by Workflow Enactment Engine• Many languages in use

– Trade off: familiarity & availability– Trade off: detailed control versus abstraction

• Incrementally develop correct process– Sharable & Editable– Basis for scientific communication & validation– Valuable IPR asset

• Repetition is now easy– Parameterised explicitly & implicitly

Page 13: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 13

Workflow SystemsWorkflow Systems

Language WF Enact. Comments

Shell scripts

Shell + OS Common but not often thought of as WF. Depend on context, e.g. NFS across all sites

Perl Perl runtime

Popular in bioinformatics. Similar context dependence – distribution has to be coded

Java JVM Popular target because JVM ubiquity – similar dependence – distribution has to be coded

BPEL BPEL Enactment

OASIS standard for industry – coordinating use of multiple Web Services – low level detail - tools

Taverna Scufl EBI, OMII-UK & MyGrid http://taverna.sourceforge.net/index.php

VDT / Pegasus

Chimera & DAGman

High-level abstract formulation of workflows, automated mapping towards executable forms, cached result re-use

Kepler Kepler BIRN, GEON & SEEKhttp://kepler-project.org/

Page 14: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006

©

14

Workflow example

Taverna in MyGrid http://www.mygrid.org.uk/ “allows the e-Scientist to describe and enact their

experimental processes in a structured, repeatable and verifiable way”

GUI Workflow

language Enactment

engine

Page 15: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006

©

15

Pub/Sub for Laboratory data using a broker and ultimately delivered over GPRS

Notification

Comb-e-chem: Jeremy Frey

Page 16: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 16

Relevance to Digital LibrariesRelevance to Digital Libraries

• Similar concerns– Data curation & management– Metadata, discovery– Secure access (AAA +)– Provenance & data quality– Local autonomy– Availability, resilience

• Common technology– Grid as an implementation technology

Page 17: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 17

Page 18: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 18

What is a Grid?

LicenseLicense

PrinterPrinter

A grid is a system consisting of− Distributed but connected resources and − Software and/or hardware that provides and manages logically

seamless access to those resources to meet desired objectives

A grid is a system consisting of− Distributed but connected resources and − Software and/or hardware that provides and manages logically

seamless access to those resources to meet desired objectives

R2AD

DatabaseDatabase

Webserver

Webserver

Data CenterCluster

Handheld Supercomputer

Workstation

Server

Source: Hiro Kishimoto GGF17 Keynote May 2006

Page 19: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 19

Virtualizing Resources

Resources

Webservices

AccessAccess

StorageStorage SensorsSensors ApplicationsApplications InformationInformationComputersComputers

Resource-specific InterfacesResource-specific Interfaces

Common Interfaces

Type-specific interfaces

Hiro Kishimoto: Keynote GGF17

Page 20: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 20

Ideas and FormsIdeas and Forms

• Key ideas– Virtualised resources– Secure access– Local autonomy

• Many forms– Cycle stealing– Linked supercomputers– Distributed file systems– Federated databases– Commercial data centres– Utility computing

Page 21: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 21

Grid Middleware

Virtualized resources

Grid middleware

services

Brokering Service

Brokering Service

Registry Service

Registry Service

DataService

DataService

CPU ResourceCPU ResourcePrinter Service

Printer Service

Job-Submit Service

Job-Submit Service

ComputeService

ComputeService

No

tify

Ad

vertise

ApplicationService

ApplicationService

Hiro Kishimoto: Keynote GGF17

Page 22: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 22

Key Drivers for GridsKey Drivers for Grids

• Collaboration– Expertise is distributed– Resources (data, software licences) are location-specific– Necessary to achieve critical mass of effort– Necessary to raise sufficient resources

• Computational Power– Rapid growth in number of processors– Powered by Moore’s law + device roadmap– Challenge to transform models to exploit this

• Deluge of Data– Growth in scale: Number and Size of resources– Growth in complexity– Policy drives greater data availability

Page 23: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 23

Minimum Grid FunctionalitiesMinimum Grid Functionalities

• Supports distributed computation– Data and computation– Over a variety of

• hardware components (servers, data stores, …)• Software components (services: resource managers,

computation and data services)

– With regularity that can be exploited• By applications• By other middleware & tools• By providers and operations

– It will normally have security mechanisms • To develop and sustain trust regimes

Page 24: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 24Source: Hiro Kishimoto GGF17 Keynote May 2006

Grid & Related Paradigms

Utility Computing• Computing “services”• No knowledge of provider• Enabled by grid technology

Utility Computing• Computing “services”• No knowledge of provider• Enabled by grid technology

Distributed Computing• Loosely coupled• Heterogeneous• Single Administration

Distributed Computing• Loosely coupled• Heterogeneous• Single Administration

Cluster• Tightly coupled• Homogeneous• Cooperative working

Cluster• Tightly coupled• Homogeneous• Cooperative working

Grid Computing• Large scale• Cross-organizational• Geographical distribution• Distributed Management

Grid Computing• Large scale• Cross-organizational• Geographical distribution• Distributed Management

Page 25: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 25

Page 26: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 26

Why use / build Grids?Why use / build Grids?

• Research Arguments– Enables new ways of working– New distributed & collaborative research– Unprecedented scale and resources

• Economic Arguments– Reduced system management costs– Shared resources better utilisation– Pooled resources increased capacity– Load sharing & utility computing – Cheaper disaster recovery

Page 27: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 27

Why use / build Grids?Why use / build Grids?

• Operational Arguments– Enable autonomous organisations to

• Write complementary software components• Set up run & use complementary services• Share operational responsibility• General & consistent environment for

Abstraction, Automation, Optimisation & Tools

• Political & Management Arguments– Stimulate innovation– Promote intra-organisation collaboration– Promote inter-enterprise collaboration

Page 28: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 28

Grids In Use: E-Science Examples

• Data sharing and integration− Life sciences, sharing standard data-sets,

combining collaborative data-sets− Medical informatics, integrating hospital information

systems for better care and better science− Sciences, high-energy physics

• Data sharing and integration− Life sciences, sharing standard data-sets,

combining collaborative data-sets− Medical informatics, integrating hospital information

systems for better care and better science− Sciences, high-energy physics

• Capability computing− Life sciences, molecular modeling, tomography− Engineering, materials science− Sciences, astronomy, physics

• Capability computing− Life sciences, molecular modeling, tomography− Engineering, materials science− Sciences, astronomy, physics

• High-throughput, capacity computing for − Life sciences: BLAST, CHARMM, drug screening− Engineering: aircraft design, materials, biomedical− Sciences: high-energy physics, economic modeling

• High-throughput, capacity computing for − Life sciences: BLAST, CHARMM, drug screening− Engineering: aircraft design, materials, biomedical− Sciences: high-energy physics, economic modeling

• Simulation-based science and engineering− Earthquake simulation

• Simulation-based science and engineering− Earthquake simulation

Source: Hiro Kishimoto GGF17 Keynote May 2006

Page 29: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 29

Page 30: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

PDB 33,367 Protein structuresEMBL DB 111,416,302,701 nucleotides

Database GrowthDatabase Growth

Slide provided by Richard Baldock: MRC HGU Edinburgh

Page 31: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 31

Requirements: User’s viewpointRequirements: User’s viewpoint

• Find Data– Registries & Human communication

• Understand data– Metadata description, Standard / familiar formats &

representations, Standard value systems & ontologies

• Data Access– Find how to interact with data resource– Obtain permission (authority)– Make connection– Make selection

• Move Data– In bulk or streamed (in increments)

Page 32: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 32

Requirements: User’s viewpoint 2Requirements: User’s viewpoint 2

• Transform Data– To format, organisation & representation

required for computation or integration

• Combine data– Standard database operations + operations relevant to

the application model

• Present results– To humans: data movement + transform for viewing– To application code: data movement + transform to the

required format– To standard analysis tools, e.g. R– To standard visualisation tools, e.g. Spitfire

Page 33: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 33

Requirements: Owner’s viewpointRequirements: Owner’s viewpoint

• Create Data– Automated generation, Accession Policies, Metadata

generation– Storage Resources

• Preserve Data– Archiving– Replication– Metadata– Protection

• Provide Services with available resources– Definition & implementation: costs & stability– Resources: storage, compute & bandwidth

Page 34: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 34

Requirements: Owner’s viewpoint 2Requirements: Owner’s viewpoint 2

• Protect Services– Authentication, Authorisation, Accounting, Audit– Reputation

• Protect data– Comply with owner requirements – encryption for privacy,

• Monitor and Control use– Detect and handle failures, attacks, misbehaving users– Plan for future loads and services

• Establish case for Continuation– Usage statistics– Discoveries enabled

Page 35: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 35

Page 36: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 36

Large Hadron ColliderLarge Hadron Collider

• The most powerful instrument ever built to investigate elementary particle physics

• Data Challenge:– 10 Petabytes/year of data– 20 million CDs each year!

• Simulation, reconstruction, analysis:– LHC data handling requires

computing power equivalent to ~100,000 of today's fastest PC processors

Page 37: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 37

Composing Observations in AstronomyComposing Observations in Astronomy

Data and images courtesy Alex Szalay, John Hopkins

No. & sizes of data sets as of mid-2002, grouped by wavelength• 12 waveband coverage of large areas of the sky• Total about 200 TB data• Doubling every 12 months• Largest catalogues near 1B objects

Page 38: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

GODIVA Data Portal• Grid for Ocean Diagnostics, Interactive

Visualisation and Analysis

• Daily Met Office Marine Forecasts and gridded research datasets

• National Centre for Ocean Forecasting

• ~3Tb climate model datastore via Web Services

• Interactive Visualisations inc. Movies

• ~ 30 accesses a day worldwide

• Other GODIVA software produces 3D/4D Visualisations reading data remotely via Web Services

Online Movies

www.nerc-essc.ac.uk/godiva

Page 39: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

GODIVA Visualisations• Unstructured Meshes

• Grid Rotation/Interpolation

• GeoSpatial Databases v. Files (Postgres, IBM, Oracle)• Perspective 3D Visualisation

• Google maps viewer

Page 40: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

NERC Data Grid

• The DataGrid focuses on federation of NERC Data Centres

• Grid for data discovery, delivery and use across sites

• Data can be stored in many different ways (flat files, databases…)

• Strong focus on Metadata and Ontologies

• Clear separation between discovery and use of data.

• Prototype focussing on Atmospheric and Oceanographic data

www.ndg.nerc.ac.uk

Page 41: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

Global In-flight Engine DiagnosticsGlobal In-flight Engine Diagnostics

in-flight data

airline

maintenance centre

ground station

global networkeg SITA

internet, e-mail, pager

DS&S Engine Health Center

data centre

Distributed Aircraft Maintenance Environment: Leeds, Oxford, Sheffield &York, Jim Austin

100,000 aircraft

0.5 GB/flight

4 flights/day

200 TB/day

Now BROADEN

Significant ingetting Boeing 787 engine contract

Page 42: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 42

Page 43: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 43

Storage Resource Manager (SRM)Storage Resource Manager (SRM)

• http://sdm.lbl.gov/srm-wg/• de facto & written standard in physics, …• Collaborative effort

– CERN, FNAL,  JLAB, LBNL and RAL

• Essential bulk file storage– (pre) allocation of storage

• abstraction over storage systems

– File delivery / registration / access– Data movement interfaces

• E.g. gridFTP

• Rich function set– Space management, permissions, directory, data transfer

& discovery

Page 44: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 44

Storage Resource Broker (SRB)Storage Resource Broker (SRB)

• http://www.sdsc.edu/srb/index.php/Main_Page• SDSC developed• Widely used

– Archival document storage– Scientific data: bio-sciences, medicine, geo-sciences, …

• Manages – Storage resource allocation

• abstraction over storage systems

– File storage– Collections of files– Metadata describing files, collections, etc. – Data transfer services

Page 45: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 45

Condor Data ManagementCondor Data Management

• Stork– Manages File Transfers– May manage reservations

• Nest– Manages Data Storage– C.f. GridFTP with reservations

• Over multiple protocols

Page 46: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 46

Globus Tools and Services for Data Management

GridFTP A secure, robust, efficient data transfer protocol

The Reliable File Transfer Service (RFT) Web services-based, stores state about transfers

The Data Access and Integration Service (OGSA-DAI) Service to access to data resources, particularly relational and

XML databases

The Replica Location Service (RLS) Distributed registry that records locations of data copies

The Data Replication Service Web services-based, combines data replication and

registration functionality

Slides from Ann Chervenak

Page 47: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 47

RLS in Production Use: LIGO

Laser Interferometer Gravitational Wave Observatory Currently use RLS servers at 10 sites

Contain mappings from 6 million logical files to over 40 million physical replicas

Used in customized data management system: the LIGO Lightweight Data Replicator System (LDR)

Includes RLS, GridFTP, custom metadata catalog, tools for storage management and data validation

Slides from Ann Chervenak

Page 48: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 48

RLS in Production Use: ESG

Earth System Grid: Climate modeling data (CCSM, PCM, IPCC)

RLS at 4 sites Data management coordinated

by ESG portal Datasets stored at NCAR

64.41 TB in 397253 total files 1230 portal users

IPCC Data at LLNL 26.50 TB in 59,300 files 400 registered users Data downloaded: 56.80 TB in

263,800 files Avg. 300GB downloaded/day 200+ research papers being

written

Slides from Ann Chervenak

Page 49: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 20062nd EGEE Review, CERN - gLite Middleware Status

49

Enabling Grids for E-sciencE

INFSO-RI-508833

gLite Data Management

• FTS– File Transfer Service

• LFC– Logical file catalogue

• Replication Service– Accessed through LFC

• AMGA– Metadata services

Page 50: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 20062nd EGEE Review, CERN - gLite Middleware Status

50

Enabling Grids for E-sciencE

INFSO-RI-508833

Data Management Services

• FiReMan catalog– Resolves logical filenames (LFN) to physical location of files and storage elements– Oracle and MySQL versions available– Secure services– Attribute support– Symbolic link support– Deployed on the Pre-Production Service and DILIGENT testbed

• gLite I/O– Posix-like access to Grid files– Castor, dCache and DPM support – Has been used for the BioMedical Demo– Deployed on the Pre-Production Service and the DILIGENT testbed

• AMGA MetaData Catalog– Used by the LHCb experiment– Has been used for the BioMedical Demo

Medical Data Management 3

Enabling Grids for E-sciencE

ClientClient

Medical Data Management

Application

MDM Client LibraryMDM Client Library

Grid CatalogsGrid Catalogs

MetadataMetadataCatalog (AMGA)Catalog (AMGA)

MedicalImager

EncryptionEncryptionKeystoreKeystore (Hydra)(Hydra)

File CatalogFile Catalog(Fireman)(Fireman)

SRM DICOMSRM DICOM

MDM TriggerMDM Trigger

GridFTPGridFTP

gLitegLite I/OI/O

Trigger:

• Retrieve DICOM files from imager.

• Register file in Fireman

• gLite EDS client: Generate encryption keys and store them in Hydra

• Register Metadata in AMGA

Client Library:

• Lookup file through Metadata (AMGA)

• Use gLite EDS client:

• Retrieve file through gLite I/O

• Retrieve encryption Key from Hydra

• Decrypt data

• Serve it up to the application

Page 51: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 20062nd EGEE Review, CERN - gLite Middleware Status

51

Enabling Grids for E-sciencE

INFSO-RI-508833

File Transfer Service

• Reliable file transfer• Full scalable implementation

– Java Web Service front-end, C++ Agents, Oracle or MySQL database support– Support for Channel, Site and VO management– Interfaces for management and statistics monitoring

• Gsiftp, SRM and SRM-copy support• Support for MySQL and Oracle• Multi-VO support• GridFTP and SRM copy support

Page 52: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 52

Commercial SolutionsCommercial Solutions

• Vendors include:– Avaki– Data Synapse

• Benefits & costs– Well packaged and documented– Support– Can be expensive

• But look for academic rates

Page 53: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 53

Page 54: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 54

Data Integration StrategiesData Integration Strategies

• Use a Service provided by a Data Owner

• Use a scripted workflow• Use data virtualisation services

– Arrange that multiple data services have common properties

– Arrange federations of these– Arrange access presenting the common

properties– Expose the important differences– Support integration accommodating those

differences

Page 55: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 55

Data Virtualisation ServicesData Virtualisation Services

• Form a federation– Set of data resources – incremental addition– Registration & description of collected resources– Warehouse data or access dynamically to obtain updated data– Virtual data warehouses – automating division between collection and

dynamic access • Describe relevant relationships between data sources

– Incremental description + refinement / correction• Run jobs, queries & workflows against combined set of data

resources– Automated distribution & transformation

• Example systems– IBM’s Information Integrator– GEON, BIRN & SEEK– OGSA-DAI is an extensible framework for building such systems

Page 56: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 56

Virtualisation variationsVirtualisation variations

• Extent to which homogeneity obtained– Regular representation choices – e.g. units– Consistent ontologies– Consistent data model– Consistent schema – integrated super-schema– DB operations supported across federation– Ease of adding federation elements– Ease of accommodating change as federation

members change their schema and policies– Drill through to primary forms supported

Page 57: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 57

OGSA-DAIOGSA-DAI

• http://www.ogsadai.org.uk • A framework for data virtualisation• Wide use in e-Science

– BRIDGES, GEON, CaBiG, GeneGrid, MyGrid, BioSimGrid, e-Diamond, IU RGRBench, …

• Collaborative effort– NeSC, EPCC, IBM, Oracle, Manchester, Newcastle

• Querying of data resources– Relational databases– XML databases– Structured flat files

• Extensible activity documents– Customisation for particular applications

Page 58: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 58

Page 59: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 59

The Open Grid Services Architecture

• An open, service-oriented architecture (SOA)− Resources as first-class entities− Dynamic service/resource creation and destruction

• Built on a Web services infrastructure

• Resource virtualization at the core

• Build grids from small number of standards-based components− Replaceable, coarse-grained− e.g. brokers

• Customizable− Support for dynamic, domain-specific content…− …within the same standardized framework

Hiro Kishimoto: Keynote GGF17

Page 60: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 60

OGSA Capabilities

Security• Cross-organizational users• Trust nobody• Authorized access only

Security• Cross-organizational users• Trust nobody• Authorized access only

Information Services• Registry• Notification• Logging/auditing

Information Services• Registry• Notification• Logging/auditing

Execution Management• Job description & submission• Scheduling• Resource provisioning

Execution Management• Job description & submission• Scheduling• Resource provisioning

Data Services• Common access facilities• Efficient & reliable transport• Replication services

Data Services• Common access facilities• Efficient & reliable transport• Replication services

Self-Management• Self-configuration• Self-optimization• Self-healing

Self-Management• Self-configuration• Self-optimization• Self-healing

Resource Management• Discovery• Monitoring• Control

Resource Management• Discovery• Monitoring• Control

OGSAOGSA

OGSA “profiles”OGSA “profiles”

Web services foundation Web services foundation

Hiro Kishimoto: Keynote GGF17

Page 61: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 61

Basic Data Interfaces

• Storage Management− e.g. Storage Resource

Management (SRM)

• Storage Management− e.g. Storage Resource

Management (SRM)

• Data Access− ByteIO− Data Access & Integration

(DAI)

• Data Access− ByteIO− Data Access & Integration

(DAI)

• Data Transfer− Data Movement Interface

Specification (DMIS)− Protocols (e.g. GridFTP)

• Data Transfer− Data Movement Interface

Specification (DMIS)− Protocols (e.g. GridFTP)

• Replica management

• Metadata catalog

• Cache management

• Replica management

• Metadata catalog

• Cache management

Hiro Kishimoto: Keynote GGF17

Page 62: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 62

Page 63: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 63

The State of the ArtThe State of the Art

• Many successful Grid & E-Science projects– A few examples shown in this talk

• Many Grid systems– All largely incompatible– Interoperation talks under way

• Standardisation efforts– Mainly via the Open Grid Forum– A merger of the GGF & EGA

• Significant user investment required– Few “out of the box” solutions

Page 64: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 64

Technical ChallengesTechnical Challenges

• Issues you can’t avoid– Lack of Complete Knowledge (LOCK)– Latency– Heterogeneity– Autonomy– Unreliability– Scalability– Change

• A Challenging goal– balance technical feasibility– against virtual homogeneity, stability and reliability– while remaining affordable, manageable and maintainable

Page 65: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 65

Areas “In Development”Areas “In Development”

• Data provenance• Quality of Service

– Service Level Agreements

• Resource brokering– Across all resources

• Workflow scheduling– Co-sheduling

• Licence management• Software provisioning

– Deployment and update

• Other areas too!

Page 66: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 66

Operational ChallengesOperational Challenges

• Management of distributed systems– With local autonomy

• Deployment, testing & monitoring• User training• User support• Rollout of upgrades• Security

– Distributed identity management– Authorisation– Revocation– Incident response

Page 67: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 67

Grids as a Foundation for SolutionsGrids as a Foundation for Solutions

• The grid per se doesn’t provide– Supported e-Science methods– Supported data & information resources– Computations – Convenient access

• Grids help providers of these, via– International & national secure e-Infrastructure– Standards for interoperation– Standard APIs to promote re-use

• But Research Support must be built– Application developers– Resource providers

Page 68: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 68

Collaboration ChallengesCollaboration Challenges

• Defining common goals

• Defining common formats– E.g. schemas for data and metadata

• Defining a common vocabulary– E.g. for metadata

• Finding common technology– Standards should help, eventually

• Collecting metadata– Automate where possible

Page 69: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 69

Social ChallengesSocial Challenges

• Changing cultures– Rewarding data & resource sharing– Require publication of data

• Taking the first steps– If everyone shares, everyone wins– The first people to share must not lose out

• Sustainable funding– Technology must persist– Data must persist

Page 70: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 70

Page 71: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 71

SummarySummary

• E-Science exploits distributed computing resource to enable new discoveries, new collaborations and new ways of working

• Grid is an enabling technology for e-science.

• Many successful projects exist

• Many challenges remain

Page 72: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 72

Globus Alliance

CeSC (Cambridge)

DigitalCurationCentre

e-Science Institute

UK e-ScienceUK e-Science

EGEE, ChinaGri

d

Grid Operations

SupportCentre

NationalCentre fore-SocialScience

National Institute

forEnvironmental

e-Science

OpenMiddleware

InfrastructureInstitute

Page 73: TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh

TICER Summer School, August 24th 2006 73