san diego supercomputer centernational partnership for advanced computational infrastructure1 grid...
TRANSCRIPT
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure1
Grid Based Solutions for Distributed Data
Management
Reagan W. MooreSan Diego Supercomputer Center
http://www.npaci.edu/[email protected]
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure2
Topics
• Managing data residing in multiple storage systems
• Building collections of distributed data
• Supporting digital library services
• Federating collections
• Preserving collections
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure3
Storage Resource Broker
• Generic data management infrastructure that is used to support:– Data grids for data sharing– Digital libraries for data publication– Persistent archives for data preservation
• Manages distributed data on national and international scales– California Digital Library– NSF National Science Digital Library– Worldwide Universities Network data grid
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure4
Storage Resource Broker Collections at SDSC(9/27/2004)
GBs ofdata
stored
Numberof files
Numberof Users
Data Grid
NSF/ITR - National Virtual Observatory 53,778 9,507,399 80NSF - National Partnership for Advanced Computational Infrastructure 22,165 5,156,765 380
Hayden Planetarium - Evolution of the Solar System visualizations 7,201 113,600 178
NSF/NPACI - Joint Center for Structural Genomics 5,228 652,031 50
NSF/NPACI - Biology and Environmental collections 8,704 21,881 67
NSF - TeraGrid, ENZO Cosmology simulations 104,370 908,600 3,247
NIH - Biomedical Informatics Research Network 5,808 3,777886 172
Digital Library
NLM - Digital Embryo image collection 720 45,365 23
NSF/NPACI - Long Term Ecological Reserve 251 8,381 36
NSF/NPACI - Grid Portal 1,917 49,665 392
NIH - Alliance for Cell Signaling microarray data 776 60,177 21
NSF - National Science Digital Library SIO Explorer collection 2,122 758,233 27
NSF/NPACI -Transana education research video collection 92 2,387 26
NSF/ITR - Southern California Earthquake Center 88,199 1,790,319 59
Persistent Archive
UCSD Libraries image collection 128 203,930 29
NARA- Research Prototype Persistent Archive 89 254,470 58
NSF - National Science Digital Library persistent archive 3,571 26,908,350 122
TOTAL 305 TB 50 million 4,967
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure5
Managing Distributed Data
Storage Repository
• Storage location
• User name
• File name
• File context (creation date,…)
• Access constraints
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Naming conventions provided by storage systems
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure6
Storage Resource Broker Data Grid
Storage Repository
• Storage location
• User name
• File name
• File context (creation date,…)
• Access constraints
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical context (metadata)
• Control/consistency constraints
Data Collection
Data Access Methods (Web Browser, DSpace, OAI-PMH)
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure7
Discovery
• Data grids associate metadata with each digital entity (file, SQL command, URL) that is registered– Administrative metadata (location of file, owner, access
controls, size, audit trail)– Descriptive metadata (Dublin core, annotations)– Curator-defined metadata (can define collection level
metadata, and metadata unique to a digital entity)
• Metadata query mechanisms include:– Web browsers, DSpace, OAI-PMH, WSDL, Perl, Python,
Windows browser, Java class library, Unix shell commands, C library calls
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure9
Search Capabilities
• Browse within collection hierarchy• Search by attribute name and operations on attribute
values across all types of metadata– Dublin core attributes– Administrative attributes– Curator-defined attributes
• SRB manages access controls on metadata attributes and on digital entities– Metadata not displayed for digital entities that have restricted
access– Metadata not displayed for attributes that have restricted access
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure10
Access Mechanisms• Files, clicking on the record downloads the file• URLs, clicking redirects to the web page• SQL commands, clicking causes the SQL command (with
input parameters) to be issued to the database and the result is returned as HTML or XML
• Additional operations that support– Replication / Caching / Staging / Pre-fetch (partial read) / Bulk unload
/ Parallel I/O streams / Remote procedures for filtering and subsetting
• Asynchronous interfaces:– DSpace mechanisms, Storage Resource Manager
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure11
Timeliness
• Data grids self-consistently manage all registered digital entities– All operations on digital entities automatically
update the administrative metadata– Synchronization flags kept for replicas– Write locks kept for files aggregated into
containers
• Federated digital libraries are synchronized under curator control
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure12
Federation
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical context (metadata)
• Control/consistency constraints
Data Collection B
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical context (metadata)
• Control/consistency constraints
Data Collection A
Access controls and consistency constraints on cross registration of digital entities
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure13
Consistency Constraints
• Master-slave data grids– The entries in the slave data grid are registered under control of
the master data grid
• Peer-to-peer data grids– Curators register selected material into another data grid. Access
controls are kept by the original data grid.
• Central repository– Remote data grids push material, user names, metadata into a
central repository
• Deep archive– Digital entities and metadata are replicated into a data grid under
curator control, but no other users are allowed access
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure14
Software Costs• Storage Resource Broker clients are open source - distributed for
free• Storage Resource Broker server source code is distributed to
academic institutions for free– Commercial companies should talk to the University of California, San
Diego Technology Transfer Office for server source code
• SRB data grid uses commercially available systems for storing:– Metadata - Oracle, DB2, Sybase, Informix, PostgreSQL, mySQL – Files - Unix file systems, Linux, Mac OS X, Windows, binary large objects
in databases, object ring buffers, HPSS, UniTree, ADSM, DMF, archival storage systems
• If you use Postgres or mySQL for your database, the cost is zero. However large collections (millions of files) should use a commercial database
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure15
Hardware Costs• SRB software can be installed on laptops (Windows, Linux,
Mac OS X), servers (Sun, Linux, Irix, AIX, HP), and supercomputers (clusters)– Installation on a Mac laptop takes 15 minutes, including a Postgres
database, metadata catalog, server, and clients
• Grid Bricks - commodity-based disk systems– Provide 2.5 Ghz CPU, 1 Gbyte of memory, Gig-E network connection,
5 terabytes of disk, RAID controller, Linux operating system– Effective cost is $2000 per terabyte– Modular system that can be expanded by adding grid bricks. The SRB
data grid manages global name spaces.
• If you use your own storage system, the cost is zero
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure16
Processing and Administrative Costs• SRB data grid supports digital entities
– Any type of file can be stored– Files can be registered from an existing storage system,
preserving both the organization and names
• Administration costs– Data grid administrator - manage the data grid servers, track
problems with access to storage systems, installation of additional servers, registration of users
– Database administrator - manage the database in which the metadata is stored, perform backups, track software upgrades
– Security, network, and storage system administrators - standard administrative support for storage systems and networks
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure17
Summary
• SRB provides collection management of data distributed across multiple storage systems– Support technology evolution - migration to new
storage systems and new databases
– Support federation - controlled sharing and publication of data between data grids
– Support preservation - tracking of audit trails, checksums for validating integrity
– Support all sizes of collections - thousands to hundreds of millions of records
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure18
Unix Shell
Java, NTBrowsers
OAI,WSDL,WSRF
HTTPDSpace
OpenDAP
Archives - Tape,HPSS, ADSM,
UniTree, DMF, CASTOR,ADS
DatabasesDB2, Oracle, Sybase,SQLserver,Postgres,
mySQL, Informix
File SystemsUnix, NT,Mac OSX
Application
ORB
Storage Repository VirtualizationCatalog Abstraction
DatabasesDB2, Oracle, Sybase,
Postgres, mySQL,Informix
C, C++, Java Libraries
Logical Name Space
LatencyManagement
DataTransport
MetadataTransport
Consistency & Metadata Management / Authorization,Authentication,Audit
Linux I/O
DLL /Python,
Perl
Federation Management
Data Grid Federation - zoneSRB
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure19
For More Information
Reagan W. MooreSan Diego Supercomputer Center
http://www.npaci.edu/DICE
http://www.npaci.edu/DICE/SRB
http://www.npaci.edu/dice/srb/mySRB/mySRB.html
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure20
PRDLA Collection at SDSC (2003)
• Collection size– 800 Gbytes– 14 million files
• Server capacity– Windows NT with 2 Tbytes disk– AIT2 tape library for backup, 1 Tbyte of tape– 3 web servers
• Access rate– Average 1 million web page accesses per month– Does not count Siku server
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure21
Data Grid Opportunities
• Provide uniform interface to data collections that reside at member sites– Provides way to extend PRDLA published
holdings by incorporating new material
• Replicate collections between sites– Provides way to protect against natural disasters
• Integrate file access with archive access– Provides way to preserve collections
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure22
Data Grids
• Software systems that manage distributed data– Organize distributed data into a logical collection
• Provide global naming conventions– Location independent identifiers
• Support curation processes– Access controls for adding files– Browsing and discovery services
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure23
Accessing Data at Multiple Sites
Archiveat SDSC
File Systemin Australia
File Systemin Taiwan
User ApplicationEach site has their ownnaming convention forfiles
A data grid provides auniform way to nameand access the files across the sites
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure24
Building Distributed Collection
Archiveat SDSC
Data GridCommon naming convention and set of attributes for describing digital entities
User Application
Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata
Inter-realm authentication Single sign-on system
File Systemin Australia
File Systemin Taiwan
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure25
Collection Metadata Catalog
Logical file name space(associate metadata attributes with the logical file name)
Physical location of the fileName of the file on the storage systemSize of the filesOwner of the fileAccess controls on the file
(associate digital library attributes with the logical file name)Descriptive metadata about the fileDublin Core provenance information about the fileAnnotations on the file
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure26
Storage Systems Provide
• File name - naming convention for files
• Storage location - IP address of the storage system
• User name - persons who have access to the storage system
• File context (creation date,…) - state information about each file
• Access constraints - controls on access
Each storage repository uses a different set of naming conventions
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure27
Managing Distributed Data(Replace naming conventions used by a storage repository
with naming conventions managed by the data grid)
Storage Repository
• File name
• Storage location
• User name
• File context (creation date,…)
• Access constraints
Data Grid
• Logical file name space
• Logical resource name space
• Logical user name space
• Logical metadata context
• Control/consistency constraints
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure28
Accessing Multiple Types of Storage Systems
User Application
Archiveat SDSC
Databasein Australia
File Systemin Taiwan
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure29
Standard Data Access Operations
Common set of operations for interacting with every type of storage repository
User ApplicationRemote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries
Archiveat SDSC
Databasein Australia
File Systemin Taiwan
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure30
Data Grid Applications
• Data grid for managing distributed data– Latency management for bulk analyses of collections
– Infrastructure independent name spaces for describing data, resources, users, and state information
• Digital library for managing data context– Curation services for managing collections
– Descriptive metadata
• Persistent archive to manage technology evolution– Interoperability mechanisms between heterogeneous
storage systems and user access mechanisms
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure31
Provide uniform interface to data collections that reside at member sites• Install a Storage Resource Broker application level
server on each storage system that holds data• Register the data into the PRDLA data grid
– Establishes a logical file name for each file
• Create a collection hierarchy to support browsing and discovery– Register PRDLA metadata for each file– The SRB data grid manages the metadata for the data
grid; automatically updates information on the location of the file
• Provide web-based access to the collections– Other access mechanisms support bulk load operations
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure32
Replicate collections between sites
• Use data grid commands to replicate a collection onto a remote storage system– Information about the replicated files is kept in
the metadata catalog
• Provides way to support load balancing– Sites access data that is closer to them
• Provides a way to protect against a local natural disaster– Files can be retreived from the remote site
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure33
Integrate file access with archive access
• Can also replicate metadata catalog between sites– Provides way to manage long-term preservation, a deep
archive
• Data grid provides synchronization mechanisms to update the metadata catalog– Can control execution of the synchronization
mechanisms
• Data grid provides file validation mechanisms to verify file integrity (checksums)– Can verify a local copy against the checksums stored in
the metadata catalog
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure34
PRDLA Data Grid
• Propose the formation of a data grid linking PRDLA sites– Support data sharing– Support integration of digital libraries– Support preservation environments
• Storage Resource Broker data grid is in production use in international projects
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure35
Data Grid Installations
• Australia - University of Queensland, APAC• Japan - KEK (Tsukuba)• Korea - KISTI, Korea Institute of Science and
Technology Information• Singapore - National University of Singapore• Taiwan - National Taiwan University• University of California - California Digital
Library, UCSD