san diego supercomputer centernational partnership for advanced computational infrastructure1 grid...

35
San Diego Supercomputer Center National Partnership for Advanced Computational Infrast 1 Grid Based Solutions for Distributed Data Management Reagan W. Moore San Diego Supercomputer Center http://www.npaci.edu/DICE [email protected]

Upload: kylee-morrow

Post on 15-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure1

Grid Based Solutions for Distributed Data

Management

Reagan W. MooreSan Diego Supercomputer Center

http://www.npaci.edu/[email protected]

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure2

Topics

• Managing data residing in multiple storage systems

• Building collections of distributed data

• Supporting digital library services

• Federating collections

• Preserving collections

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure3

Storage Resource Broker

• Generic data management infrastructure that is used to support:– Data grids for data sharing– Digital libraries for data publication– Persistent archives for data preservation

• Manages distributed data on national and international scales– California Digital Library– NSF National Science Digital Library– Worldwide Universities Network data grid

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure4

Storage Resource Broker Collections at SDSC(9/27/2004)

GBs ofdata

stored

Numberof files

Numberof Users

Data Grid      

NSF/ITR - National Virtual Observatory 53,778 9,507,399 80NSF - National Partnership for Advanced Computational Infrastructure 22,165 5,156,765 380

Hayden Planetarium - Evolution of the Solar System visualizations 7,201 113,600 178

NSF/NPACI - Joint Center for Structural Genomics 5,228 652,031 50

NSF/NPACI - Biology and Environmental collections 8,704 21,881 67

NSF - TeraGrid, ENZO Cosmology simulations 104,370 908,600 3,247

NIH - Biomedical Informatics Research Network 5,808 3,777886 172

Digital Library      

NLM - Digital Embryo image collection 720 45,365 23

NSF/NPACI - Long Term Ecological Reserve 251 8,381 36

NSF/NPACI - Grid Portal 1,917 49,665 392

NIH - Alliance for Cell Signaling microarray data 776 60,177 21

NSF - National Science Digital Library SIO Explorer collection 2,122 758,233 27

NSF/NPACI -Transana education research video collection 92 2,387 26

NSF/ITR - Southern California Earthquake Center 88,199 1,790,319 59

Persistent Archive      

UCSD Libraries image collection 128 203,930 29

NARA- Research Prototype Persistent Archive 89 254,470 58

NSF - National Science Digital Library persistent archive 3,571 26,908,350 122

TOTAL 305 TB 50 million 4,967

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure5

Managing Distributed Data

Storage Repository

• Storage location

• User name

• File name

• File context (creation date,…)

• Access constraints

Data Access Methods (Web Browser, DSpace, OAI-PMH)

Naming conventions provided by storage systems

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure6

Storage Resource Broker Data Grid

Storage Repository

• Storage location

• User name

• File name

• File context (creation date,…)

• Access constraints

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints

Data Collection

Data Access Methods (Web Browser, DSpace, OAI-PMH)

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure7

Discovery

• Data grids associate metadata with each digital entity (file, SQL command, URL) that is registered– Administrative metadata (location of file, owner, access

controls, size, audit trail)– Descriptive metadata (Dublin core, annotations)– Curator-defined metadata (can define collection level

metadata, and metadata unique to a digital entity)

• Metadata query mechanisms include:– Web browsers, DSpace, OAI-PMH, WSDL, Perl, Python,

Windows browser, Java class library, Unix shell commands, C library calls

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure8

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure9

Search Capabilities

• Browse within collection hierarchy• Search by attribute name and operations on attribute

values across all types of metadata– Dublin core attributes– Administrative attributes– Curator-defined attributes

• SRB manages access controls on metadata attributes and on digital entities– Metadata not displayed for digital entities that have restricted

access– Metadata not displayed for attributes that have restricted access

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure10

Access Mechanisms• Files, clicking on the record downloads the file• URLs, clicking redirects to the web page• SQL commands, clicking causes the SQL command (with

input parameters) to be issued to the database and the result is returned as HTML or XML

• Additional operations that support– Replication / Caching / Staging / Pre-fetch (partial read) / Bulk unload

/ Parallel I/O streams / Remote procedures for filtering and subsetting

• Asynchronous interfaces:– DSpace mechanisms, Storage Resource Manager

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure11

Timeliness

• Data grids self-consistently manage all registered digital entities– All operations on digital entities automatically

update the administrative metadata– Synchronization flags kept for replicas– Write locks kept for files aggregated into

containers

• Federated digital libraries are synchronized under curator control

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure12

Federation

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints

Data Collection B

Data Access Methods (Web Browser, DSpace, OAI-PMH)

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints

Data Collection A

Access controls and consistency constraints on cross registration of digital entities

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure13

Consistency Constraints

• Master-slave data grids– The entries in the slave data grid are registered under control of

the master data grid

• Peer-to-peer data grids– Curators register selected material into another data grid. Access

controls are kept by the original data grid.

• Central repository– Remote data grids push material, user names, metadata into a

central repository

• Deep archive– Digital entities and metadata are replicated into a data grid under

curator control, but no other users are allowed access

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure14

Software Costs• Storage Resource Broker clients are open source - distributed for

free• Storage Resource Broker server source code is distributed to

academic institutions for free– Commercial companies should talk to the University of California, San

Diego Technology Transfer Office for server source code

• SRB data grid uses commercially available systems for storing:– Metadata - Oracle, DB2, Sybase, Informix, PostgreSQL, mySQL – Files - Unix file systems, Linux, Mac OS X, Windows, binary large objects

in databases, object ring buffers, HPSS, UniTree, ADSM, DMF, archival storage systems

• If you use Postgres or mySQL for your database, the cost is zero. However large collections (millions of files) should use a commercial database

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure15

Hardware Costs• SRB software can be installed on laptops (Windows, Linux,

Mac OS X), servers (Sun, Linux, Irix, AIX, HP), and supercomputers (clusters)– Installation on a Mac laptop takes 15 minutes, including a Postgres

database, metadata catalog, server, and clients

• Grid Bricks - commodity-based disk systems– Provide 2.5 Ghz CPU, 1 Gbyte of memory, Gig-E network connection,

5 terabytes of disk, RAID controller, Linux operating system– Effective cost is $2000 per terabyte– Modular system that can be expanded by adding grid bricks. The SRB

data grid manages global name spaces.

• If you use your own storage system, the cost is zero

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure16

Processing and Administrative Costs• SRB data grid supports digital entities

– Any type of file can be stored– Files can be registered from an existing storage system,

preserving both the organization and names

• Administration costs– Data grid administrator - manage the data grid servers, track

problems with access to storage systems, installation of additional servers, registration of users

– Database administrator - manage the database in which the metadata is stored, perform backups, track software upgrades

– Security, network, and storage system administrators - standard administrative support for storage systems and networks

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure17

Summary

• SRB provides collection management of data distributed across multiple storage systems– Support technology evolution - migration to new

storage systems and new databases

– Support federation - controlled sharing and publication of data between data grids

– Support preservation - tracking of audit trails, checksums for validating integrity

– Support all sizes of collections - thousands to hundreds of millions of records

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure18

Unix Shell

Java, NTBrowsers

OAI,WSDL,WSRF

HTTPDSpace

OpenDAP

Archives - Tape,HPSS, ADSM,

UniTree, DMF, CASTOR,ADS

DatabasesDB2, Oracle, Sybase,SQLserver,Postgres,

mySQL, Informix

File SystemsUnix, NT,Mac OSX

Application

ORB

Storage Repository VirtualizationCatalog Abstraction

DatabasesDB2, Oracle, Sybase,

Postgres, mySQL,Informix

C, C++, Java Libraries

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency & Metadata Management / Authorization,Authentication,Audit

Linux I/O

DLL /Python,

Perl

Federation Management

Data Grid Federation - zoneSRB

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure19

For More Information

Reagan W. MooreSan Diego Supercomputer Center

[email protected]

http://www.npaci.edu/DICE

http://www.npaci.edu/DICE/SRB

http://www.npaci.edu/dice/srb/mySRB/mySRB.html

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure20

PRDLA Collection at SDSC (2003)

• Collection size– 800 Gbytes– 14 million files

• Server capacity– Windows NT with 2 Tbytes disk– AIT2 tape library for backup, 1 Tbyte of tape– 3 web servers

• Access rate– Average 1 million web page accesses per month– Does not count Siku server

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure21

Data Grid Opportunities

• Provide uniform interface to data collections that reside at member sites– Provides way to extend PRDLA published

holdings by incorporating new material

• Replicate collections between sites– Provides way to protect against natural disasters

• Integrate file access with archive access– Provides way to preserve collections

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure22

Data Grids

• Software systems that manage distributed data– Organize distributed data into a logical collection

• Provide global naming conventions– Location independent identifiers

• Support curation processes– Access controls for adding files– Browsing and discovery services

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure23

Accessing Data at Multiple Sites

Archiveat SDSC

File Systemin Australia

File Systemin Taiwan

User ApplicationEach site has their ownnaming convention forfiles

A data grid provides auniform way to nameand access the files across the sites

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure24

Building Distributed Collection

Archiveat SDSC

Data GridCommon naming convention and set of attributes for describing digital entities

User Application

Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata

Inter-realm authentication Single sign-on system

File Systemin Australia

File Systemin Taiwan

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure25

Collection Metadata Catalog

Logical file name space(associate metadata attributes with the logical file name)

Physical location of the fileName of the file on the storage systemSize of the filesOwner of the fileAccess controls on the file

(associate digital library attributes with the logical file name)Descriptive metadata about the fileDublin Core provenance information about the fileAnnotations on the file

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure26

Storage Systems Provide

• File name - naming convention for files

• Storage location - IP address of the storage system

• User name - persons who have access to the storage system

• File context (creation date,…) - state information about each file

• Access constraints - controls on access

Each storage repository uses a different set of naming conventions

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure27

Managing Distributed Data(Replace naming conventions used by a storage repository

with naming conventions managed by the data grid)

Storage Repository

• File name

• Storage location

• User name

• File context (creation date,…)

• Access constraints

Data Grid

• Logical file name space

• Logical resource name space

• Logical user name space

• Logical metadata context

• Control/consistency constraints

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure28

Accessing Multiple Types of Storage Systems

User Application

Archiveat SDSC

Databasein Australia

File Systemin Taiwan

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure29

Standard Data Access Operations

Common set of operations for interacting with every type of storage repository

User ApplicationRemote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries

Archiveat SDSC

Databasein Australia

File Systemin Taiwan

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure30

Data Grid Applications

• Data grid for managing distributed data– Latency management for bulk analyses of collections

– Infrastructure independent name spaces for describing data, resources, users, and state information

• Digital library for managing data context– Curation services for managing collections

– Descriptive metadata

• Persistent archive to manage technology evolution– Interoperability mechanisms between heterogeneous

storage systems and user access mechanisms

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure31

Provide uniform interface to data collections that reside at member sites• Install a Storage Resource Broker application level

server on each storage system that holds data• Register the data into the PRDLA data grid

– Establishes a logical file name for each file

• Create a collection hierarchy to support browsing and discovery– Register PRDLA metadata for each file– The SRB data grid manages the metadata for the data

grid; automatically updates information on the location of the file

• Provide web-based access to the collections– Other access mechanisms support bulk load operations

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure32

Replicate collections between sites

• Use data grid commands to replicate a collection onto a remote storage system– Information about the replicated files is kept in

the metadata catalog

• Provides way to support load balancing– Sites access data that is closer to them

• Provides a way to protect against a local natural disaster– Files can be retreived from the remote site

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure33

Integrate file access with archive access

• Can also replicate metadata catalog between sites– Provides way to manage long-term preservation, a deep

archive

• Data grid provides synchronization mechanisms to update the metadata catalog– Can control execution of the synchronization

mechanisms

• Data grid provides file validation mechanisms to verify file integrity (checksums)– Can verify a local copy against the checksums stored in

the metadata catalog

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure34

PRDLA Data Grid

• Propose the formation of a data grid linking PRDLA sites– Support data sharing– Support integration of digital libraries– Support preservation environments

• Storage Resource Broker data grid is in production use in international projects

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure35

Data Grid Installations

• Australia - University of Queensland, APAC• Japan - KEK (Tsukuba)• Korea - KISTI, Korea Institute of Science and

Technology Information• Singapore - National University of Singapore• Taiwan - National Taiwan University• University of California - California Digital

Library, UCSD