storage and data

82
Storage and Data Grid Middleware 6 David Groep, lecture series 2005-2006

Upload: vivien

Post on 11-Jan-2016

33 views

Category:

Documents


1 download

DESCRIPTION

Storage and Data. Grid Middleware 6 David Groep, lecture series 2005-2006. Outline. Data management concepts metadata, logical filename, SURL, TURL, object store Protocols GridFTP, SRM RFT/FTS, FPS & scheduled transfers with GT4 (LIGO) End-to-end integrated systems SRB - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Storage and Data

Storage and Data

Grid Middleware 6

David Groep, lecture series 2005-2006

Page 2: Storage and Data

Grid Middleware VI 2

Outline

Data management concepts metadata, logical filename, SURL, TURL, object store

Protocols GridFTP, SRM RFT/FTS, FPS & scheduled transfers with GT4 (LIGO)

End-to-end integrated systems SRB

Structured data and databases OGSA-DAI

Data curation issues media migration content conversion (emulation or translation?)

Page 3: Storage and Data

Grid Middleware VI 3

Grid data management

Data in a grid need to be located replicated life-time managed accessed (sequentially and at random)

and the user does not know where the data is

Page 4: Storage and Data

Grid Middleware VI 4

Types of storage

‘File oriented’ storage cannot support content-based queries needs annotation & metadata to be useful

(note that a file system and name is a ‘type of meta-data’) most implementations can handle any-sized object

(but MSS tape systems cannot handle very small files)

Databases structured data representation supports content queries well via indexed searches good for small data objects

(with BLOBs of MBytes, not GBytes)

Page 5: Storage and Data

Grid storage structure

For file oriented storage

Page 6: Storage and Data

Grid Middleware VI 6

File storage layers (file system analogy)

Separation the storage concepts helps for both better interoperation and scalability

1. Semantic view description of data in words and phrases

2. Meta-data view describe data by attribute-value pairs (filename is also an A-V pair) like filesystems like HPFS, EXT2+, AppleFS with ‘extended attributes’

3. Object view refers to a blob of data by a meaningless handle (unique ID) e.g. in typical Unix FS’s: inode FAT: directory entry + alloc table (mixes filename and object view)

4. Physical view block devices: series of blocks on a disk, or a specific tape & offset

Page 7: Storage and Data

Grid Middleware VI 7

Storage layers (grid naming terminology)

LFN (Logical File Name) – level 2 like the filename in the traditional file system may have hierarchical structure is not directly suitable for access, as it is site independent

GUID (Globally Unique ID) – level 3 opaque handle to reference a specific data object still independent of the site GUID-LFN mapping in 1-n

SURL (Storage URL, of physical file name PFN) – level 3 SE specific reference to a file understood by the storage management interface GUID-SURL mapping is 1-n

TURL (Transfer URL) – ‘griddy level 4’ current physical location of a file inside a specific SE is transient (i.e. only exists after being returned by the SE management

interface) has a specific lifetime SURL-TURL mapping is 1-(small number, typically 1)

terminology from EDG, gLite and Globus

Page 8: Storage and Data

Grid Middleware VI 8

Data Management Services Overview

Page 9: Storage and Data

Grid Middleware VI 9

Storage concepts

using the OSG-EDG-gLite terminology …

Storage Element management interface transfer interface(s)

Catalogues File Catalogue (meta-data catalogues) Replica Catalogue (location services & indices)

Transfer Service File Placement Data Scheduler

Page 10: Storage and Data

Grid Middleware VI 10

Grid Storage Concepts: Storage Element

Storage Element responsible for manipulating files, on anything from disk to tape-

backed mass storage contains services up to the filename level the filename typically an opaque handle for files,

as a higher-level file catalogue serves the meta-data, and the same physical file will be replicated to several SEs with different local

file names SE is a site function (not a VO function)

Capabilities Storage space for files Storage Management interface (staging, pinning) Space management (reservation) Access (read/write, e.g. via gridFTP,

HTTP(s), Posix (like)) File Transfer Service (controlling influx of data

from other SEs)

Page 11: Storage and Data

Grid Middleware VI 11

Storage Element: grid transfer services

Possiblities GridFTP

de-facto standard protocol supports GSI security features: striping & parallel transfers,

third-party transfers (TPTs, like regular FTP) part of protocol issue: firewalls don’t ‘like’ open port ranges needed by FTP

(neither active nor passive)

HTTPs single port, so more firewall-friendly implementation of GSI and delegation

required (mod_gridsite) TPTs not part of protocol …

Page 12: Storage and Data

Grid Middleware VI 12

GridFTP

‘secure, robust, fast, efficient, standards based, widely accepted’ data transfer protocol

Protocol based Multiple Independent implementation can interoperate

Globus Toolkit supplies reference implementation Server, Client tools (globus-url-copy), Development Libraries

Page 13: Storage and Data

Grid Middleware VI 13

GridFTP: The Protocol

FTP protocol is defined by several IETF RFCs Start with most commonly used subset

Standard FTP: get/put etc., 3rd-party transfer

Implement standard but often unused features GSS binding, extended directory listing, simple restart

Extend in various ways, while preserving interoperability with existing servers Striped/parallel data channels, partial file, automatic & manual TCP buffer

setting, progress monitoring, extended restart

source: Bill Allcock, ANL, Overview of GT4 Data Services, 2004

Page 14: Storage and Data

Grid Middleware VI 14

GridFTP: The Protocol (cont)

Existing standards RFC 959: File Transfer Protocol RFC 2228: FTP Security Extensions RFC 2389: Feature Negotiation for the File Transfer Protocol Draft: FTP Extensions GridFTP: Protocol Extensions to FTP for the Grid

Grid Forum Recommendation GFD.20 http://www.ggf.org/documents/GWD-R/GFD-R.020.pdf

source: Bill Allcock, ANL, Overview of GT4 Data Services, 2004

Page 15: Storage and Data

Grid Middleware VI 15

Striped Server Mode Multiple nodes work together *on a single file* and act as a

single GridFTP server An underlying parallel file system allows all nodes to see the

same file system and must deliver good performance (usually the limiting factor in transfer speed) I.e., NFS does not cut it

Each node then moves (reads or writes) only the pieces of the file that it is responsible for.

This allows multiple levels of parallelism, CPU, bus, NIC, disk, etc. Critical if you want to achieve better than 1 Gbs without breaking

the bank

source: Bill Allcock, ANL, Overview of GT4 Data Services, 2004

Page 16: Storage and Data

Grid Middleware VI 16

MODE ESPAS (Listen) - returns list of host:port pairsSTOR <FileName>

MODE ESPOR (Connect) - connect to the host-port pairsRETR <FileName>

18-Nov-03

GridFTP Striped Transfer

Host Z

Host Y

Host A

Block 1

Block 5

Block 13

Block 9

Host B

Block 2

Block 6

Block 14

Block 10

Host C

Block 3

Block 7

Block 15

Block 11

Host D

Block 4

Block 8 - > Host D

Block 16

Block 12 -> Host D

Host X

Block1 -> Host A

Block 13 -> Host A

Block 9 -> Host A

Block 2 -> Host B

Block 14 -> Host B

Block 10 -> Host B

Block 3 -> Host C

Block 7 -> Host C

Block 15 -> Host C

Block 11 -> Host C

Block 16 -> Host D

Block 4 -> Host D

Block 5 -> Host A

Block 6 -> Host B

Block 8

Block 12

source: Bill Allcock, ANL, Overview of GT4 Data Services, 2004

Page 17: Storage and Data

Grid Middleware VI 17

Disk to Disk Striping PerformanceBANDWIDTH Vs STRIPING

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

0 10 20 30 40 50 60 70

Degree of Striping

Ban

dw

idth

(M

bp

s)

# Stream = 1 # Stream = 2 # Stream = 4 # Stream = 8 # Stream = 16 # Stream = 32

source: Bill Allcock, ANL, Overview of GT4 Data Services, 2004

Page 18: Storage and Data

Grid Middleware VI 18

GridFTP: Caveats

Protocol requires that the sending side do the TCP connect (possible Firewall issues) Working on V2 of the protocol

Add explicit negotiation of streams to relax the directionality requirement above(*)

Optionally adds block checksums and resends Add a unique command ID to allow pipelining of commands

Client / Server Currently, no server library, therefore Peer to Peer type apps

VERY difficult Generally needs a pre-installed server

Looking at a “dynamically installable” server

source: Bill Allcock, ANL, Overview of GT4 Data Services, 2004

(*)DG: like a kind of application-level BEEP protocol

Page 19: Storage and Data

Grid Middleware VI 19

SE transfers: random access

wide-area R/A for files is newtypically address by adding GSI to existing cluster protocols

dcap -> GSI-dcap rfio -> GSI-RFIO xrootd -> ??

One (new) OGSA-style service WS-ByteIO

Bulk interface RandomIO interface

posix-like

needs negotiation of actual transfer protocol attachment, DIME, …

Page 20: Storage and Data

Grid Middleware VI 20

SE transfer: local back-end access

backend of a grid store is not always just a disk distributed storage systems without native posix

even if posix emulation is provided, that is always slower!

for grid use, need to also provide GridFTP and a management interface: SRM

local access might be through the native protocol but the application may not know and it is usually not secure enough to run over WAN so no use for ‘non-LAN’ use by others in the grid

Page 21: Storage and Data

Grid Middleware VI 21

Storage Management (SRM)

common management interface on top of many backend storage solutions

a GGF draft standard (from the GSM-WG)

Page 22: Storage and Data

Grid Middleware VI 22

Standards for Storage Resource Management

Main concepts Allocate spaces

Get/put files from/into spaces

Pin files for a lifetime

Release files and spaces

Get files into spaces from remote sites

Manage directory structures in spaces

SRMs communicate other SRMs as peer-to-peer

Negotiate transfer protocols

No logical name space management (can come from GGF-

GFS)

source: A. Sim, CRD, LBNL 2005

Page 23: Storage and Data

Grid Middleware VI 23

SRM Functional Concepts

Manage Spaces dynamically Reservation, allocation, lifetime Release, compact Negotiation

Manage files in spaces Request to put files in spaces Request to get files from spaces Lifetime, pining of files, release of files No logical name space management (rely on GFS)

Access remote sites for files Bring files from other sites and SRMs as requested Use existing transport services (GridFTP, http, https, ftp, bbftp, …) Transfer protocol negotiation

Manage multi-file requests Manage request queues Manage caches, pre-caching (staging) when possible Manage garbage collection

Directory Management Manage directory structure in spaces Unix semantics: srmLs, srmMkdir, srmMv, srmRm, srmRmdir

Possible Grid access to/from MSS HPSS, MSS, Enstore, JasMINE, Castor

source: A. Sim, CRD, LBNL 2005

Page 24: Storage and Data

Grid Middleware VI 24

SRM Methods by the features

Core (Basic)srmChangeFileStorageType

srmExtendFileLifetimesrmGetFeatures

srmGetRequestSummarysrmGetRequestToken

srmGetSRMStorageInfosrmGetSURLMetaData

srmGetTransferProtocolssrmPrepareToGetsrmPrepareToPutsrmPutFileDone

srmPutRequestDonesrmReleaseFiles

srmStatusOfGetRequestsrmStatusOfPutRequestsrmTerminateRequest

 

Space managementsrmCompactSpace

srmGetSpaceMetaDatasrmGetSpaceToken

srmReleaseFilesFromSpacesrmReleaseSpacesrmReserveSpacesrmUpdateSpace

 Authorization Functions

srmCheckPermissionsrmGetStatusOfReassignment

srmReassignToUsersrmSetPermission

Request Administration

srmAbortRequestedFilessrmRemoveRequestedFiles

srmResumeRequestsrmSuspendRequest

 Copy Function

srmCopysrmStatusOfCopyRequest

 Directory Function

srmCpsrmLs

srmMkdirsrmMvsrmRm

srmRmdirsrmStatusOfCpRequestsrmStatusOfLsRequest

source: A. Sim, CRD, LBNL 2005

Page 25: Storage and Data

Grid Middleware VI 25

SRM interactions

Data ServerGridftp Daemon

ClientDPM Daemon

SRM Daemon

1a. SRM Put

1b. Put intoRequest Database

1c. Return SRM RequestId

DPM Database

DPNS Daemon

Data ServerGridftp Daemon

Data ServerGridftp Daemon

Page 26: Storage and Data

Grid Middleware VI 26

SRM Interactions

2a. Get Request from Database

2d. Add TURL in Request

Database and Mark ‘Ready’

2c. Pick best Data Server to put data onto

Data ServerGridftp Daemon

Client

SRM Daemon

2b. Check permissions and add to NS

DPM Database

DPNS Daemon

Data ServerGridftp Daemon

Data ServerGridftp Daemon

2e.add to replica table and set status ‘Pending’

DPM Daemon

Page 27: Storage and Data

Grid Middleware VI 27

SRM Interactions

3a. SRM getRequestStatus

Data ServerGridftp Daemon

ClientDPM Daemon

SRM Daemon

3c. Return TURL

DPM Database

3b. Get TURL from Request

DPNS Daemon

Data ServerGridftp Daemon

Data ServerGridftp Daemon

Page 28: Storage and Data

Grid Middleware VI 28

SRM Interactions

Data ServerGridftp Daemon

Client

SRM Daemon DPM Database

DPNS Daemon

Data ServerGridftp Daemon

Data ServerGridftp Daemon

4a. SRM(v1) set ‘Running’

4b. Update status of request

DPM Daemon

Page 29: Storage and Data

Grid Middleware VI 29

SRM Interactions

Data ServerGridftp Daemon

ClientDPM Daemon

SRM Daemon

5. put file via Gridftp

DPM Database

DPNS Daemon

Data ServerGridftp Daemon

Data ServerGridftp Daemon

Page 30: Storage and Data

Grid Middleware VI 30

SRM Interactions

6c. Get filesize

Data ServerGridftp Daemon

Client

SRM Daemon DPM Database

DPNS Daemon

Data ServerGridftp Daemon

Data ServerGridftp Daemon

6a. SRM(v1) set Done 6e. Update status of request

6d. Update replica metadata(size/status/pintime)

6b. Notify ‘Done’

DPM Daemon

Page 31: Storage and Data

Grid Middleware VI 31

Storage infra example with SRM

graphic: Mark van de Sanden, SARA

Page 32: Storage and Data

Grid Middleware VI 32

SRM Summary

SRM is a functional definition Adaptable to different frameworks for operation (WS, WSRF, …)

Multiple implementations interoperate Permit special purpose implementations for unique products Permits interchanging one SRM product by another

SRM implementations exist and some in production use Particle Physics Data Grid Earth System Grid More coming …

Cumulative experiences SRM v3.0 specifications to complete

source: A. Sim, CRD, LBNL 2005

Page 33: Storage and Data

Grid Middleware VI 33

Replicating Data

Data on the grid may, will and should exist in multiple copies

Replicas may be temporary for the duration of the job opportunistically stored on cheap but unreliable storage contain output cached near a compute site for later scheduled

replication

Replicas may also provide redundancy application level instead of site-local RAID or backup

Page 34: Storage and Data

Grid Middleware VI 34

Replication issues

Replicas are difficult to manage if the data is modifiable and consistency is required

Grid DM today does not address modifiable data setsas soon as more than one copy of the data exists otherwise, result would be either inconsistency or requires close coordination between storage locations (slow) or almost guarantees a deadlock

Some wide-area distributed file systems do this (AFS,DFS) but are not scalable or require a highly available network

Page 35: Storage and Data

Grid Middleware VI 35

Grid Storage concepts: Catalogues

Catalogues index of files that link to a single object (referenced by GUID) Catalogues logically a VO function, with local instances per site

Capabilities expose mappings, not actual data

File or Meta-data Catalogue: names, metadata -> GUID Replica Catalogue and Index:

GUID - SURLs for all SEs containing the file

Page 36: Storage and Data

Grid Middleware VI 36

File Catalogues

Page 37: Storage and Data

Grid Middleware VI 37

graphic: Peter Kunszt, EGEE DJRA1.4 gLite Architecture

Page 38: Storage and Data

Grid Middleware VI 38

Alternatives to the File Catalogue

Store SURLs with data in application DB schema better adapted to the application needs easier integration in existing frameworks

Page 39: Storage and Data

Grid Middleware VI 39

Grid Storage Concepts: Transfer Service

Transfer service responsible for moving (replicating) data between SEs transfers are scheduled, as data movement capacity is scarce

(not because of WAN network bandwidth, but because of CPU capacity and disk/tape bandwidth in data movement nodes!)

logically a per VO function, hosted at the site builds on top of the SE abstraction and a data movement protocol

and is co-ordinated with a specific SE

Capabilities transfer SURL at SE1 to new SURL at SE2

using SE mechanisms such as SRM-COPY, or directly GridFTP either push or pull

subject to a set of policies, e.g. max. number of simultaneous transfers between SE1 and SE2 with specific timeout or #retries

asynchronous states like: SUBMITTED, PENDING, ACTIVE, CANCELLING,

CANCELLED, DONE_INCOMPLETE, DONE_COMPLETE update replica catalogues (GUID->SURL mappings)

Page 40: Storage and Data

Grid Middleware VI 40

File Transfer Service

graphic: gLite Architecture v1.0 (EGEE-I DJRA1.1)

Page 41: Storage and Data

Grid Middleware VI 41

FTS ‘Channels’

Scheduled number of transfers from one site to a (set of) other sites

below: CERNCI to sites on the OPN (next slide)

Page 42: Storage and Data

Grid Middleware VI 42

FTS channels

for scaling reasons one transfer agent for each channel, i.e. each SRC<->TGT pair agents can be spread over multiple boxes

Page 43: Storage and Data

Grid Middleware VI 43

LHC

OPN

Page 44: Storage and Data

Grid Middleware VI 44

in network terms

Cricket graph 2006 CERN->SARA via OPN link speed is 10 Gb/s

Page 45: Storage and Data

Grid Middleware VI 45

FTS complex services

Protocol translation although many will, not all SEs support GridFTP FTS in that case needs protocol translation

translation through memory excludes third-party transfers

Other Issues credential handling

files on the source and target SE are readable for specific users and specific VO (groups)

SEs are site services, and sites want to be access by the end-user credential for tracability (not a generic “VO” account)

continued access to the user credential needed (like in any compute broker)

Page 46: Storage and Data

Grid Middleware VI 46

Grid Storage Concept: File Placement

Placement Service manage transfers for which the host site is the destination coordinate updates up the VO file catalogue and the actual

transfers (via the FTS, a site-managed service)

Capabilities transfer GUID or LFN from A to B

(note: the FTS could only operate on SURLs) needs access to the VO catalogues,

and thus needs sufficient privileges to do the job(i.e. update the catalogues)

API can be the same as for the FTS

Page 47: Storage and Data

Grid Middleware VI 47

Data Scheduler

Like the placement service, but can direct requests to different sites

Page 48: Storage and Data

Grid Middleware VI 48

DM: Putting it all together

graphic: gLite Architecture v1.0 (EGEE-I DJRA1.1)

Page 49: Storage and Data

Grid Middleware VI 49

GT4 view on the same issues

Similar functionalitybut more closely linked to the VO than the site

based on soft-state registrations(like the information system)

treats files as the basic resource abstraction

next two slides: Ann Chervenak, ISI/USC: Overview of GT4 Data Management Services, 2004

Page 50: Storage and Data

Grid Middleware VI 50

LRC LRC LRC

RLIRLI

LRCLRC

Replica Location Indexes

Local Replica Catalogs

• Replica Location Index (RLI) nodes aggregate information about one or more LRCs

• LRCs use soft state update mechanisms to inform RLIs about their state: relaxed consistency of index

• Optional compression of state updates reduces communication, CPU and storage overheads

• Membership service registers participating LRCs and RLIs and deals with changes in membership

RLS Framework

• Local Replica Catalogs (LRCs) contain consistent information about logical-to-target mappings

Page 51: Storage and Data

Grid Middleware VI 51

Replica Location Service In ContextReplica Location Service In Context

Replica Location ServiceReliable Data

Transfer Service

GridFTP

Reliable Replication Service

Replica Consistency Management Services

MetadataService

The Replica Location Service is one component in a layered data management architectureProvides a simple, distributed registry of mappingsConsistency management provided by higher-level services

Page 52: Storage and Data

Grid Middleware VI 52

Access Control Lists

Catalogue level protects access to meta-data is only advisory for actual file access

unless the storage system only accepts connections from a trusted agent that does itself a catalogue lookup

SE level either natively (i.e. supported by both the SRM and transfer services)

or via an agent-system like gLiteIO SRM/transfer level

SRM and GridFTp server need to lookup in local ACL store access rights for each transfer

need “all files owned by SRM” unless underlying FS supports ACLs OS level

native POSIX-ACL support in OS needed only available for limited number of systems (mainly disk based) not (yet) in popular HSM solutions

Page 53: Storage and Data

Grid Middleware VI 53

Grid ACL considerations

Semantics Posix semantics require that you traverse up the tree to find all

constraints behaviour both costly and possibly undefined in a distributed

context

VMS and NTFS container semantics are self-contained taken as a basis for the ACL semantics in many grid services

ACL syntax & local semantics typically Posix-style

Page 54: Storage and Data

Grid Middleware VI 54

Catalogue ACL method in GT4 with WS-RF

LRC

Policy Engine

Policy Database

LFN1 PIDA

LFN2 PIDB

PIDAgroup1: read; group2: all;

group 3: none; user7: read

PIDBgroup1: read, write;

group2: all; group 3: all

(1) Client Request

GT

4 A

utho

rizat

ion

Fra

mew

ork

(3) Request PIDs for logical names

(6) Query policies for PIDs

(2) Custom auth callout (includes client request)

(8) permit or deny

(9) If permitted, pass client request to LRC

Custom PDP

(5) Pass policy ID, subject, object, action

(7) permit or deny

(4) PIDs

graphic: Ann Chervenak, ISI/USC, from presentation to the Design Team, Argonne, 2005

Page 55: Storage and Data

Stand-alone solutionsSRB

the SDSC Storage Request Broker

Page 56: Storage and Data

Grid Middleware VI 56

SRB Data Management Objectives

Automate all aspects of data management Discovery (without knowing the file name) Access (without knowing its location) Retrieval (using your preferred API) Control (without having a personal account at the remote storage

system) Performance (use latency management mechanisms to minimize

impact of wide-area-networks)

source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC

Page 57: Storage and Data

Grid Middleware VI 57

SRBserver

SRB agent

SRBserver

Federated SRB server model

MCAT

Application

SRB agent

1

2

34

6

5

Logical NameOr

Attribute Condition

1.Logical-to-Physical mapping2.Identification of Replicas3.Access & Audit Control

Peer-to-peer

Brokering

Server(s) Spawning

Parallel Data Access

R1R2

5/6

source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC

Page 58: Storage and Data

Grid Middleware VI 58

Features

Authentication: encrypted password GSI, certificate based

Metadata has it all storage in a (definable) flat file system Data put into Collections (unix directories), access and

control operation possible parallel transport of files Physical Resources combine to Logical Resource Encrypted data and/or encrypted metadata Free-ish (educational) commercial version of an old SRB at

http://www.nirvanastorage.com

source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC

Page 59: Storage and Data

Grid Middleware VI 59

Unix Shell

Java, NTBrowsers

OAIWSDL

GridFTP

SDSC Storage Resource Broker & Meta-data Catalog

ArchivesHPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Postgres

File SystemsUnix, NT,Mac OSX

Application

HRMORB

AccessAPIs

Servers

Storage AbstractionCatalog Abstraction

DatabasesDB2, Oracle, Postgres,

SQLServer, Informix

C, C++, Libraries

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency Management / Authorization-AuthenticationPrimeServer

Linux I/O

DLL /Python

source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC

Page 60: Storage and Data

Grid Middleware VI 60

Production Data Grid

SDSC Storage Resource Broker Federated client-server system, managing

Over 70 TBs of data at SDSC Over 10 million files

Manages data collections stored in Archives (HPSS, UniTree, ADSM, DMF) Hierarchical Resource Managers Tapes, tape robots File systems (Unix, Linux, Mac OS X, Windows) FTP sites Databases (Oracle, DB2, Postgres, SQLserver,

Sybase, Informix) Virtual Object Ring Buffers

source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC

Page 61: Storage and Data

Grid Middleware VI 61

Mappings on Name Space Define logical resource name

List of physical resources Replication

Write to logical resource completes when all physical resources have a copy

Load balancing Write to a logical resource completes when copy exist on next

physical resource in the list Fault tolerance

Write to a logical resource completes when copies exist on “k” of “n” physical resources

source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC

Page 62: Storage and Data

Grid Middleware VI 62

SRB Development

Now at version 3.4 (as of November 2005) Peer-to-peer federation of ZONES

Support multiple independent MCAT catalogs Replicate metadata

mySQL/BerkeleyDB port OGSA/OGSI compliant interface GridFTP interfaces

source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC

Page 63: Storage and Data

Grid Middleware VI 63

User Interfaces

Unix Command line tools: S-commands (e.g. Sls, Spwd, Sget, Sput)

Windows SRB browser: InQ Web Interface: mySRB java and C API. java admin tools

DEMO

source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC

Page 64: Storage and Data

Grid Middleware VI 64

Administrative Interface

Also available as Unix command

java based admin tool

source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC

Page 65: Storage and Data

Grid Middleware VI 65

Unix Command-line Tool S*

source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC

Page 66: Storage and Data

Grid Middleware VI 66

Windows Browser InQ

source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC

Page 67: Storage and Data

Grid Middleware VI 67

Web Interface

source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC

Page 68: Storage and Data

Grid Middleware VI 68

Nice and Not so Nice

+ It works and is being used in “production”+ metadata based + it knows GSI and will know gridFTP- for S-commands password in plain text in file

(should not be necessary)- InQ does not know GSI- Not all interfaces have same capabilities

source: Maurice Bouwhuis, SARA

Page 69: Storage and Data

Structured DataOGSA-DAI

Page 70: Storage and Data

Grid Middleware VI 70

Access to structured data

Several layers access layer

do not virtualise schema and semantics, ‘just get there’ OGSA-DAI, Spitfire (depricated)

semantic layer interpret and attempt to merge schemas using ontology

discovery a research topic today, with some interesting results see e.g. the April VL-e workshop for some nice examples

Page 71: Storage and Data

Grid Middleware VI 71

OGSA-DAI

An extensible framework for data access and integration. Expose heterogeneous data resources to a grid through

web services. Interact with data resources:

Queries and updates. Data transformation / compression Data delivery.

Customise for your project using Additional Activities Client Toolkit APIs Data Resource handlers

A base for higher-level services federation, mining, visualisation,…

http://www.ogsadai.org.uk/

source: Amy Krause, EPCC Edinburgh: OGSA-DAI Overview, GGF17, Tokyo, 2006

Page 72: Storage and Data

Grid Middleware VI 72

Considerations

Efficient client-server communication One request specifies multiple operations

No unnecessary data movement Move computation to the data Utilise third-party delivery Apply transforms (e.g., compression)

Build on existing standards Fill-in gaps where necessary: specifications from DAIS WG

Do not hide underlying data model Users must know where to target queries, Data virtualisation is hard

Extensible architecture Extensible activity framework

Cannot anticipate all desired functionality Allow users to plug-in their own

based on: Amy Krause, EPCC Edinburgh: OGSA-DAI Overview, GGF17, Tokyo, 2006

Page 73: Storage and Data

Grid Middleware VI 73

OGSA-DAI services

OGSA-DAI uses data services to represent and provide access to a number of data resources

acce

sses

represents

Data Service

DataResource

DataResourceData

Resource

acce

sses

based on: Amy Krause, EPCC Edinburgh: OGSA-DAI Overview, GGF17, Tokyo, 2006

Page 74: Storage and Data

Grid Middleware VI 74

Services

Services co-located with the data as much as possible

MySQL

OGSA-DAI service

Engine

SQLQuery

JDBCData ServiceResources

Activities

DB2

GZip GridFTPXPath

XMLDB

eXist

readFile

File

SWISSPROT

ToCSV

SQLServer

Data-bases

ApplicationApplicationClient ToolkitClient Toolkit

based on: Amy Krause, EPCC Edinburgh: OGSA-DAI Overview, GGF17, Tokyo, 2006

Page 75: Storage and Data

Grid Middleware VI 75

Supported data sources

Relational XML Files

MySQLDB2Oracle 10SQLServerPostgreSQL

eXistXindice

Text FilesBinary FilesCSVSwissProtOMIM

based on: Amy Krause, EPCC Edinburgh: OGSA-DAI Overview, GGF17, Tokyo, 2006

Page 76: Storage and Data

Grid Middleware VI 76

Service interaction

Data Service

Activity

Activity

Activity

Client

Data Sink

<?xml?><perform>….</perform>

<?xml?><perform>….</perform>

<?xml/><response>….</response>

<?xml/><response>….</response>

…011010011101100…

based on: Amy Krause, EPCC Edinburgh: OGSA-DAI Overview, GGF17, Tokyo, 2006

Page 77: Storage and Data

Grid Middleware VI 77

Data Service internals

from: Alexander Wöhrer, AustrianGrid OGSA-DAI tutorial, GGF13 Seoul, 2005

Page 78: Storage and Data

Grid Middleware VI 78

Request/response

<perform xmlns=“…" xmlns:xsi=“…“ xsi:schemaLocation=“…"> <sqlQueryStatement name="statement"> <expression> select * from littleblackbookwhere id=10 </expression> <resultSetStream name=“output"/> </sqlQueryStatement> <deliverToURLname="deliverOutput"> <fromLocal from=“output"/> <toURL>ftp://anon:[email protected]/home</toURL> </deliverToURL></perform>

<gridDataServiceResponse xmlns=“…"> <result name="deliverOutput" status=“COMPLETED"/> <result name="statement" status=“COMPLETED"/></gridDataServiceResponse>

from: Alexander Wöhrer, AustrianGrid OGSA-DAI tutorial, GGF13 Seoul, 2005

Page 79: Storage and Data

Grid Middleware VI 79

Client library interaction

SQLQuerySQLQuery query = new SQLQuery("select * from littleblackbook

where id='3475'") XPathQueryXPathQuery query = new XPathQuery( "/entry[@id<10]" );

XSLTransformXSLTransform transform = new XSLTransform();

DeliverToGFTP; DeliverToGFTP deliver = new DeliverToGFTP("ogsadai.org.uk", 8080,

"myresults.txt" );

you have to know the backend structure of the data source

from: Alexander Wöhrer, AustrianGrid OGSA-DAI tutorial, GGF13 Seoul, 2005

Page 80: Storage and Data

Grid Middleware VI 80

Simple requests

Simple requests consist of only one activity Send the activity directly to the perform

method

SQLQuery query = new SQLQuery( "select * from littleblackbookwhere id='3475'");

Response response= service.perform( query );

from: Alexander Wöhrer, AustrianGrid OGSA-DAI tutorial, GGF13 Seoul, 2005

Page 81: Storage and Data

Closing Remarks

Page 82: Storage and Data

Grid Middleware VI 82

Miscellaneous tidbits

Data Curationthe need to preserve data over time migrating media (preserve readablility) is only one aspect need also

format conversion or emulation of the programs operating on the data

Data Provenanceneed to know how this data has come into being association of meta-data and work flow recording of workflow and w/f instances in essential this is (today) application specific, but maybe, one day, …