ogf21 preservation environments research group

39
www.gridforum.org OGF-21 Software Forum OGF21 Preservation Environments Research Group Organizers: Richard Marciano ([email protected]) Reagan Moore ([email protected] ) Goals: Analyze capabilities required by a preservation environment Define rule-based preservation environment - iRODS NARA Electronic Records Archive capability requirements RLG/NARA assessment criteria for a Trusted Digital Repository Demonstrate creation of a preservation environment based on data grid technology Demonstrate creation of preservation rules controlling a preservation environment Analyze capabilities that can be based on grid technology iRODS rule-oriented data system Participants: CASPAR - Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval SHAMAN - Sustaining Heritage Access through Multivalent Archiving Sustaining Heritage Access through Multivalent Archiving NCRIS - National Collaborative Research Infrastructure Strategy PLANETS - Preservation and Long-term Access through Networked Services MIT - DSpace digital library SDSC - NARA Transcontinental Persistent Archive Prototype U Md - Producer Archive Workflow Network UK Digital Curation Centre

Upload: kay

Post on 28-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

OGF21 Preservation Environments Research Group. Organizers:Richard Marciano ([email protected]) Reagan Moore ( [email protected] ) Goals: Analyze capabilities required by a preservation environment Define rule-based preservation environment - iRODS - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

OGF21 Preservation Environments Research Group

• Organizers: Richard Marciano ([email protected]) Reagan Moore ([email protected])

• Goals: Analyze capabilities required by a preservation environment

Define rule-based preservation environment - iRODS NARA Electronic Records Archive capability requirements RLG/NARA assessment criteria for a Trusted Digital Repository

Demonstrate creation of a preservation environment based on data grid technology Demonstrate creation of preservation rules controlling a preservation environment

Analyze capabilities that can be based on grid technology iRODS rule-oriented data system

• Participants: CASPAR - Cultural, Artistic and Scientific knowledge for Preservation, Access and

Retrieval SHAMAN - Sustaining Heritage Access through Multivalent ArchivingSustaining Heritage Access through Multivalent Archiving NCRIS - National Collaborative Research Infrastructure Strategy PLANETS - Preservation and Long-term Access through Networked Services MIT - DSpace digital library SDSC - NARA Transcontinental Persistent Archive Prototype U Md - Producer Archive Workflow Network UK Digital Curation Centre

Page 2: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Intellectual Property Policy

• I acknowledge that participation in OGF21 is subject to the OGF Intellectual Property Policy.• Intellectual Property Notices Note Well: All statements related to the activities of the OGF and

addressed to the OGF are subject to all provisions of Section 17 of GFD-C.1 (.pdf), which grants to the OGF and its participants certain licenses and rights in such statements. Such statements include verbal statements in OGF meetings, as well as written and electronic communications made at any time or place, which are addressed to: the OGF plenary session,

• any OGF working group or portion thereof, • the GFSG, or any member thereof on behalf of the GFSG, • the GFAC, or any member thereof on behalf of the GFAC, • any OGF mailing list, including any working group or research group list, or any other list functioning

under OGF auspices, • the GFD Editor or the GWD process • Statements made outside of a OGF meeting, mailing list or other function, that are clearly not intended

to be input to an OGF activity, group or function, are not subject to these provisions.• Excerpt from Section 17 of GFD-C.1 Where the GFSG knows of rights, or claimed rights, the OGF

secretariat shall attempt to obtain from the claimant of such rights, a written assurance that upon approval by the GFSG of the relevant OGF document(s), any party will be able to obtain the right to implement, use and distribute the technology or works when implementing, using or distributing technology based upon the specific specification(s) under openly specified, reasonable, non-discriminatory terms. The working group or research group proposing the use of the technology with respect to which the proprietary rights are claimed may assist the OGF secretariat in this effort. The results of this procedure shall not affect advancement of document, except that the GFSG may defer approval where a delay may facilitate the obtaining of such assurances. The results will, however, be recorded by the OGF Secretariat, and made available. The GFSG may also direct that a summary of the results be included in any GFD published containing the specification. OGF Intellectual Property Policies are adapted from the IETF Intellectual Property Policies that support the Internet Standards Process.

Page 3: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Preservation Requirements

• Authenticity Maintain information about provenance of data Assertions made about the file at the time of ingestion

• Integrity Maintain information about the management of the data Assertions made by the archivist

Access controls, audit trails, checksums, replication, synchronization, federation

• Infrastructure independence Management of properties of records independently of choice of

storage system

• Scalability Management of large collections (billions of records, petabytes of

data, thousands of attributes)

Page 4: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Data Grid Evolution

• Data grids Infrastructure independence Data sharing through data and trust virtualization

SRB - Storage Resource Broker

• Rule-based data grids Automation of management policies Management virtualization Open source software

iRODS - integrated Rule-Oriented Data System

Page 5: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Data Management Applications

• Data grids Share data - organize distributed data as a collection

• Digital libraries Publish data - support browsing and discovery

• Persistent archives Preserve data - manage technology evolution

• Real-time sensor systems Federate sensor data - integrate across sensor streams

• Workflow systems Analyze data - integrate client- & server-side workflows

Page 6: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Generic Infrastructure

• Data grids organize distributed data into shared collections Persistent name spaces for files, users, storage Collection attributes

Provenance, descriptive, system metadata

• Data grids manage heterogeneous storage systems Standard operations across file systems, tape archives, object ring

buffers Enable technology evolution

At the point in time when new technology is available, both the old and new systems can be integrated

Page 7: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Data GridData Grid

Using a Data Grid – in Abstract

Ask for d

ata

•User asks for data from the data grid

Data d

elivere

d

•The data is found and returned•Where & how details are hidden

Page 8: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Using a Data Grid - Details

iRODS Server

•Data request goes to iRODS Server

iRODS Server Metadata Catalog

DB

•Server looks up information in catalog

•Catalog tells which iRODS server has data

•1st server asks 2nd for data

•The 2nd iRODS server applies rules

•User asks for data

Page 9: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Extremely Successful• Storage Resource Broker (SRB) manages 2 PBs of data in

internationally shared collections• Data collections for NSF, NARA, NASA, DOE, DOD, NIH, LC, NHPRC,

IMLS; APAC, UK e-Science, IN2P3, KEK, … Astronomy Data grid Bio-informatics Digital library Earth Sciences Data grid Ecology Collection Education Persistent archive Engineering Digital library Environmental science Data grid High energy physics Data grid Humanities Data Grid Medical community Digital library Oceanography Real time sensor data, persistent archive Seismology Digital library, real-time sensor data

• Goal has been generic infrastructure for distributed data

Page 10: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Date

ProjectGBs of

data stored1000Õs of

filesGBs of

data stored1000Õs of

filesUsers with

ACLsGBs of

data stored1000Õs of

filesUsers with

ACLs

Data Grid NSF / NVO 17,800 5,139 51,380 8,690 80 88,216 14,550 100 NSF / NPACI 1,972 1,083 17,578 4,694 380 38,147 7,715 380 Hayden 6,800 41 7,201 113 178 8,013 161 227 Pzone 438 31 812 47 49 27,914 16,106 68 NSF / LDAS-SALK 239 1 4,562 16 66 202,312 166 67 NSF / SLAC-JCSG 514 77 4,317 563 47 21,644 2,330 55 NSF / TeraGrid 80,354 685 2,962 280,247 7,235 3,267 NIH / BIRN 5,416 3,366 148 21,000 35,301 445 NCAR 36,689 268 2 LCA 3,445 74 2Digital Library NSF / LTER 158 3 233 6 35 260 42 36 NSF / Portal 33 5 1,745 48 384 2,620 53 460 NIH / AfCS 27 4 462 49 21 733 94 21 NSF / SIO Explorer 19 1 1,734 601 27 2,750 1,202 27 NSF / SCEC 15,246 1,737 52 168,931 3,545 73 LLNL 16,931 1,895 5 CHRON 12,634 6,299 5Persistent Archive NARA 7 2 63 81 58 4,989 6,390 58 NSF / NSDL 2,785 20,054 119 7,188 77,479 136 UCSD Libraries 127 202 29 5,158 1,319 29 NHPRC / PAT 2,576 966 28 RoadNet 3,174 1,321 30 UCTV 7,140 2 5 LOC 6,644 192 8 Earth Sci 5,869 647 5TOTAL 28 TB 6 mil 194 TB 40 mil 4,635 975 TB 185 mil 5,539

5/17/02 6/30/04 9/4/07

Page 11: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Requirements Driving Evolution

• Observe that as the size of the shared collections grow, the administrative tasks can become onerous. Data grids provide mechanisms to manage recovery from all errors that

occur in the distributed environment

• Need to minimize labor support through automation of administrative functions File ingestion tasks Verification of desired collection properties Integrity checks and replica management

Page 12: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Requirements Driving Evolution

• Observe that each community has unique management policies User administration File retention & deletion Time-dependent access controls Data distribution and replication File update (versions, backups) Descriptive metadata

Page 13: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Requirements Driving Evolution

• Socialization of collections The creators of the collection have specific properties that they

assert the collection will possess Completeness Authoritative sources Authenticity

The users of the collection have their own criteria for the properties they expect

• Socialization is the mapping from creator assertions to user expectations

Page 14: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Data Grid Mechanisms

• Essential components needed for synergism implemented in SRB Infrastructure independence Data and trust virtualization

• Components needed for specific management policies and processes implemented in iRODS Map policies to rules that control all processes Map processes to standard micro-services

Page 15: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Data Management

Data ManagementEnvironment

ConservedProperties

ControlMechanisms

RemoteOperations

ManagementFunctions

AssessmentCriteria

ManagementPolicies

Capabilities

Data grid Š Management virtualizationData Management

InfrastructurePersistent

StateRules Micro-services

Data grid Š Data and trust virtualizationPhysical

InfrastructureDatabase Rule Engine Storage

System

iRODS - integrated Rule-Oriented Data SystemiRODS - integrated Rule-Oriented Data System

Page 16: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Rules

• Rule classes System enforced rules Administrator controlled rules User defined rules

• Rule execution Atomic rules - executed on each operation invoked by a client Deferred rules - executed at a future time Periodic rules - executed to validate assessment criteria and enforce

desired properties (integrity)

Page 17: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

iRODS Rule Syntax

• Event | Condition | Action-set | Recovery-set Event - triggered by operation or queued rule Condition - composed of tests on any attributes in

the persistent state information Action-set - composed from both micro-services

and rules Recovery-set - used to ensure transaction semantics

and consistent state information

• Executed by a rule engine installed at each storage location - server side workflows

Page 18: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Micro-Services

• Challenge is that storage systems do not provide desired processes Have “minimal” set of standard operations that are performed

at the storage system Have actions required by clients such as replication,

metadata extraction Create standard micro-services that aggregate storage

operations into modules that can be used to implement desired processes.

Page 19: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Data Virtualization

Storage SystemStorage System

Storage ProtocolStorage Protocol

Access InterfaceAccess Interface

Standard Micro-servicesStandard Micro-services

Data GridData Grid

Map from the actions

requested by the access

method to a standard set of

micro-services. The

standard micro-services

are mapped to the

operations supported by the storage system

Standard OperationsStandard Operations

Page 20: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

integrated Rule-Oriented Data System

Client Interface Admin Interface

Current State

Rule Invoker

MicroService

Modules

Metadata-based Services

Resources

MicroService

Modules

Resource-based Services

ServiceManager

ConsistencyCheck

Module

RuleModifierModule

ConsistencyCheck

Module

Engine

Rule

Confs

ConfigModifierModule

MetadataModifierModule

MetadataPersistent

Repository

ConsistencyCheck

Module

RuleBase

Page 21: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Distributed Management System

RuleRule

EngineEngine

DataData

TransportTransport

MetadataMetadata

CatalogCatalog

ExecutionExecution

ControlControl

MessagingMessaging

SystemSystem

ExecutionExecution

EngineEngine

VirtualizationVirtualization

ServerServer

SideSide

WorkflowWorkflow

PersistentPersistent

StateState

informationinformation

SchedulingScheduling

PolicyPolicy

ManagementManagement

Page 22: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Micro-service Classes

• Test• System• Workflow control• Client• iCAT catalog • User level invoked by “irule”• Image manipulation

Page 23: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Digital Preservation

• Preservation community is defining the rules need to assert trustworthiness of a digital repository RLG/NARA - Trustworthy Repositories Audit & Certification:

Criteria and Checklist.

http://wiki.digitalrepositoryauditandcertification.org/pub/Main/ReferenceInputDocuments/trac.pdf

• Defined 105 rules that are being implemented in iRODS

Page 24: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

RLG/NARA Assessment

• Example TRAC assessment criteria

90 Verify descriptive metadata and source against SIP template and set SIP compliance flag

91 Verify descriptive metadata against semantic term list

92 Verify status of metadata catalog backup (create a snapshot of metadata catalog)

93 Verify consistency of preservation metadata after hardware change or error

Page 25: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Classes of Assessment Criteria

• Collection properties List properties of associated name spaces Verify properties Compare properties with assertions

• Collection operations Transform file formats Migrate data Generate audit trails

• Structured information Parse audit trails to generate compliance reports Apply templates to extract information Apply templates to format state information

Page 26: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

iRODS Development

• NSF - SDCI grant “Adaptive Middleware for Community Shared Collections” iRODS development, SRB maintenance

• NARA - Transcontinental Persistent Archive Prototype Trusted repository assessment criteria

• NSF - Ocean Research Interactive Observatory Network (ORION) Real-time sensor data stream management

• NSF - Temporal Dynamics of Learning Center data grid Management of Institution Research Board approval

Page 27: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

iRODS Development Status

• Current release is version 0.9.2 June 2007

• Production release will be version 1.0 Fall quarter 2007

• International collaborations SHAMAN - University of Liverpool

Sustaining Heritage Access through Multivalent ArchiviNg UK e-Science data grid IN2P3 in Lyon, France DSpace policy management

Page 28: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Planned Development• GSI support• Time-limited sessions via a one-way hash authentication• Python Client library• GUI Browser (AJAX in development)• Driver for HPSS (in development)• Driver for SAM-QFS• Porting to additional versions of Unix/Linux• Porting to Windows• Support for MySQL as the metadata catalog• API support packages based on existing mounted collection driver• MCAT to ICAT migration tools• Extensible Metadata including Databases Access Interface• Zones/Federation • Auditing - mechanisms to record and track iRODS persistent state changes

Page 29: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

Preservation Requirements

• What are your required preservation management policies?

• What are your required preservation processes?

• What are your required preservation assessment criteria?

• What preservation systems are you using, and how can the preservation systems interoperate?

• Can a set of records be migrated from your preservation environment into another system while maintaining authenticity, integrity, and chain of custody?

Page 30: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

For More Information

Reagan W. MooreSan Diego Supercomputer Center

[email protected]

http://www.sdsc.edu/srb/http://irods.sdsc.edu/

Page 31: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

iRODS RulesRules for administrationacCreateUser||msiCreateUser##acCreateDefaultCollections##msiCommit|msiRollback##msiRollback##nopacVacuum(*arg1)||delayExec(msiVacuum,*arg1)|nopacCreateDefaultCollections||acCreateUserZoneCollections|nopacCreateUserZoneCollections||acCreateCollByAdmin(/$rodsZoneProxy/home,$otherUserName)##acCreateCollByAdmin(/$rodsZoneProxy/trash/home,$otherUserName)|nop##nopacCreateCollByAdmin(*parColl,*childColl)||msiCreateCollByAdmin(*parColl,*childColl)|nopacDeleteUser||acDeleteDefaultCollections##msiDeleteUser##msiCommit|msiRollback##msiRollback##nopacDeleteDefaultCollections||acDeleteUserZoneCollections|nopacDeleteUserZoneCollections||acDeleteCollByAdmin(/$rodsZoneProxy/home,$otherUserName)##acDeleteCollByAdmin(/$rodsZoneProxy/trash/home,$otherUserName)|nop##nopacDeleteCollByAdmin(*parColl,*childColl)||msiDeleteCollByAdmin(*parColl,*childColl)|nop

Rule for pre-processing on storage useacSetRescSchemeForCreate||msiSetDefaultResc(demoResc,noForce)##msiSetRescSortScheme(random)##msiSetRescSortScheme(byRescType)|nop##nop##nop

Rule for pre-processing on data readsacPreprocForDataObjOpen||msiSortDataObj(random)|nop

Rule for post processing data writesacPostProcForPut||nop|nopacPostProcForCopy||nop|nop

Rule for setting number of threads for parallel I/O acSetNumThreads||msiSetNumThreads(default,default,default)|nop

Rule for data deletion policy settingacDataDeletePolicy||nop|nop

Page 32: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

iRODS Demonstration

• Demonstrate generic put command ilsresc ils -l nvo iput -R demoResc ../src/icd.c nvo ils -l nvo

• Revise put command to automatically create a replica cp core.irb.1 ../../../server/config/reConfigs/core.irb ils -l nvo iput -R demoResc ../src/ipwd.c nvo ils -l nvo

• Illustrate execution of a user-defined rule icd iput carl.ged foo1 irule -vF ruleInp3

Page 33: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

iRODS Demonstration

• # iRODS Rule Base - core.irb• # Each rule consists of four parts separated by |• # The four parts are: name, conditions, function calls, and recovery.• # The calls and recoveries can be multiple ones, separated by ##.• # For each rule, the number recovery calls should match the calls;• # for example, if the 2nd call fails, the 2nd recover call is made.• # • acPreprocForDataObjOpen||msiSortDataObj(random)|nop • acSetRescSchemeForCreate||

msiSetDefaultResc(demo2Resc,noForce)##msiSetRescSortScheme(random)##msiSetRescSortScheme(byRescType)|nop##nop##nop

• acDataDeletePolicy||nop|nop

• acPostProcForPut||nop|nop

Page 34: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

iRODS Demonstration

# iRODS Rule Base - core.irb

# Each rule consists of four parts separated by |

# The four parts are: name, conditions, function calls, and recovery.

# The calls and recoveries can be multiple ones, separated by ##.

# For each rule, the number recovery calls should match the calls;

# for example, if the 2nd call fails, the 2nd recover call is made.

#

acPreprocForDataObjOpen||msiSortDataObj(random)|nop

acSetRescSchemeForCreate||msiSetDefaultResc(demo2Resc,noForce)##msiSetRescSortScheme(random)##msiSetRescSortScheme(byRescType)|nop##nop##nop

acDataDeletePolicy||nop|nop

acPostProcForPut||nop|nop

Page 35: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

iRODS Demonstration

# iRODS Rule Base# Each rule consists of four parts separated by |# The four parts are: name, conditions, function calls, and recovery.# The calls and recoveries can be multiple ones, separated by ##.# For each rule, the number of recovery calls should match the calls;# for example, if the 2nd call fails, the 2nd recovery call is made.#acPreprocForDataObjOpen||msiSortDataObj(random)|nopacSetRescSchemeForCreate||

msiSetDefaultResc(demo2Resc,noForce)##msiSetRescSortScheme(random)##msiSetRescSortScheme(byRescType)|nop##nop##nop

acDataDeletePolicy||nop|nopacPostProcForPut|$objPath like /tempZone/home/rods/nvo/*|

msiSysReplDataObj(nvoReplResc)|nopacPostProcForPut||nop|nop

Page 36: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

iRODS Demonstration

• # This is an example of an input for the irule command.• # This first input line is the rule body• # The second input line is the input parameter in the format of label=value. • # Multiple inputs can be specified using the '%' character as the separator.• # The third input line is the output description. For multiple outputs use '%’ • myTestRule||msiDataObjOpen(*A,*S_FD)

##msiDataObjCreate(*B,null,*D1_FD)##msiDataObjRead(*S_FD,100,*R1_BUF)##msiDataObjWrite(*D1_FD,*R1_BUF,*W1_LEN)##msiDataObjClose(*D1_FD,*junk2)##msiDataObjCreate(*C,null,*D2_FD)##msiDataObjRead(*S_FD,50000,*R2_BUF)##msiDataObjWrite(*D2_FD,*R2_BUF,*W2_LEN)##msiDataObjClose(*D2_FD,*junk3)##msiDataObjClose(*S_FD,*junk4)

• *A=/tempZone/home/rods/foo1%*B=/tempZone/home/rods/foo2%*C=/tempZone/home/rods/foo3

• *R1_BUF%*W2_LEN%*A

Page 37: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

iRODS Demonstration

• Add and query metadata imeta add -d foo1 speed 100 "mph" imeta add -d foo1 length 200 "ft" imeta add -d foo2 speed 300 "mph" imeta add -d foo3 length 400 "ft" imeta ls -d foo1 imeta qu -d speed = 100 imeta qu -d speed ">=" 100 imeta qu -d length ">=" 100

• Copy Metadata imeta ls -d foo1 imeta ls -d foo3 imeta cp -d -d foo1 foo3 imeta ls -d foo3

Page 38: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

iRODS Demonstration

• Copy metadata attributes of a file to a collection imeta ls -C /tempZone/home/rods imeta cp -d -C foo1 /tempZone/home/rods imeta ls -C /tempZone/home/rods

Page 39: OGF21 Preservation Environments Research Group

www.gridforum.org OGF-21 Software Forum

More Information

[email protected]

SRB:http://www.sdsc.edu/srb

iRODS:http://irods.sdsc.edu/