arcot rajasekar , reagan moore, bertram ludäscher, ilya zaslavsky [email protected]
DESCRIPTION
The GRID Adventures: SDSC's Storage Resource Broker and Web Services in Digital Library Applications. Arcot Rajasekar , Reagan Moore, Bertram Ludäscher, Ilya Zaslavsky [email protected] San Diego Supercomputer Center University of California, San Diego. Staff Reagan Moore Chaitan Baru. - PowerPoint PPT PresentationTRANSCRIPT
The GRID Adventures: SDSC's Storage Resource Broker
and Web Services in Digital Library Applications
Arcot Rajasekar, Reagan Moore, Bertram Ludäscher, Ilya Zaslavsky
San Diego Supercomputer CenterUniversity of California, San Diego
2 RCDL’02, Dubna, October 15-17 2002
Data and Knowledge SystemsStaff• Reagan Moore• Chaitan Baru
• Data Mining Lab (Tony Fountain)• Advanced Query Processing Lab (Amarnath Gupta)• Knowledge-Based Integration Lab (Bertram Ludäscher)• Data Grid Lab (Arcot Rajasekar)• Spatial Information Systems Lab (Ilya Zaslavsky)
+ 2-3 programmers in each lab, + graduate and undergraduate students
Now: connecting research with production databases and data grid solutions
3 RCDL’02, Dubna, October 15-17 2002
Overview• Intro
– SDSC and NPACI
• Part I: technologies– What is Data Grid– Data, Information, and Knowledge Infrastructures at SDSC/DICE– SDSC Storage Resource Broker, with examples– MIX (Mediation of Information Using XML), and Knowledge-Based
Mediation
• Part II: case studies– BIRN: the First Operational Data Grid– Web Services Demos– Persistent Archives at SDSC
• Summary
4 RCDL’02, Dubna, October 15-17 2002
A Distributed National Laboratory for Computational Science and Engineering
5 RCDL’02, Dubna, October 15-17 2002
1st Teraflops System for US Academia
• 1 TFLOPs IBM SP– 144 8-processor compute nodes– 12 2-processor service nodes– 1,176 Power3 processors at 222
MHz– Initially > 640 GB memory (4
GB/node), upgrade to > 1 TB later
– 6.8 TB switch-attached disk storage
• Largest SP with 8-way nodes• High-performance access to HPSS
6 RCDL’02, Dubna, October 15-17 2002
Bioinformatics Infrastructure for Large-Scale Analyses
• Next-generation tools for accessing, manipulating, and analyzing biological data– Biology, Stanford University– DICE, SDSC
• Analysis of Protein Data Bank, GenBank and other databases
• Accelerate key discoveries for health and medicine
• Supporting and leveraging new data grid projects, such as BIRN in biology
Part I: technologies
What is Data GridData, Information, and Knowledge
Infrastructures at SDSC/DICESDSC Storage Resource Broker
MIX (Mediation of Information Using XML), and Knowledge-Based Mediation
SRB
8 RCDL’02, Dubna, October 15-17 2002
What are Data Grids?• Power Grid Analogy
– Multiple power generators– Complex transmission networks
with switching– Simple Usage Interface – plug and play– Guaranteed Supply - Meeting of
demands (peak and lull)– Complex cost function
• More than one data provider• Best movement of data across computer networks• Seamless Access to Data with good ‘Finding Aids’ • Guarantee of Data Access• Access Control, Quotas & Complex Usage Costing
9 RCDL’02, Dubna, October 15-17 2002
Data Grids
Data Grid - linking multiple data collectionsSeparate name spacesSeparate schema Separate administration domainsHeterogeneous database instances
Database A Database BData grid
The data grid is itself a collection that provides mechanisms to hide latency and manage semantics
10 RCDL’02, Dubna, October 15-17 2002
Federated Digital Libraries
Virtual Data Grid - linking multiple data collectionsAbility to execute processes to recreate derived data
Database AServices
Database BServicesVirtual Data Grid
The virtual data grid integrates data grid and digital librarytechnology to manage processes
11 RCDL’02, Dubna, October 15-17 2002
Why Data Grids: Data Handling Problems • Large Datasets; Large Number of Datasets; Scaling• Distributed, Heterogeneous Storage• Virtualization & Transparency• Collaboration, Access Control, Authentication, Security• Replication, Coherency, Synchronization• Fault Tolerance and Load Distribution• Scheduling, Caching & Data Placements• Data Migration over Time & Space• Data/Collection Curation• Uniform Name Space • Handling Legacy Data and
Data/Resource Evolution• User-friendly Interfaces – foster
collaborations
12 RCDL’02, Dubna, October 15-17 2002
Why Data Grids: Metadata Problems
• Types of Metadata – Relational to XML to unstructured• Standardized to User-defined Metadata • Large Number of Attributes; • Large Size; Scaling• Federation - integration over space• Evolution - integration over time • Evolution - integration over contexts• Discovery and Search• Presentation – user friendly• Extraction and Maintenance
13 RCDL’02, Dubna, October 15-17 2002
DAKS Data Management Hierarchy
• Model-Based Information Management– Rule-based ontology mapping, conceptual-level mediation - CMIX
• Information Mediation– Data federation across multiple libraries - MIX
• Digital Library – Interoperable services for information discovery and presentation -
SDLIP• Data Collection
– Tools for managing data set collections on databases - MCAT• Data Handling
– Systems for data retrieval from remote storage - SRB• Persistent Archives
– Storage of data collections for 30+ years
14 RCDL’02, Dubna, October 15-17 2002
SRB as a Solution
Application
SRB Server
Distributed Storage Resources(database systems, archival storage systems, file systems, ftp, http, …)
MCAT
HRM DB2, Oracle, Illustra, ObjectStore HPSS, ADSM, UniTree UNIX, NTFS, HTTP, FTP
• The Storage Resource Broker is a middleware• It virtualizes resource access• It mediates access to distributed heterogeneous resources• It uses a MetaCATalog to facilitate the brokering• It integrates data and metadata
16 RCDL’02, Dubna, October 15-17 2002
Solution SRB SDSC Storage Resource Broker & Meta-data Catalog
SRBArchives
HPSS, ADSM,UniTree, DMF
DatabasesDB2, Oracle,
Sybase
File SystemsUnix, NT,Mac OSX
Application
C, C++, Linux I/O
Unix Shell
Dublin Core
Resource,Mthd, User
User Defined
ApplicationMeta-data
RemoteProxies
DataCutter
MetadataExtraction
Java, NTBrowsers
WebPrologPredicate
MCAT
HRM
17 RCDL’02, Dubna, October 15-17 2002
SRB Space
DR
DR DR
DR
DR
DR
DL DL
DL
DL
DR - Data RepositoryDL - Dig LibraryMC - Meta Catalog
MC
Client
SRB
SRB
SRB
SRB
SRB
SRB
SRB
SRBSRB
SRB
Client
ClientClient
Client
Client
18 RCDL’02, Dubna, October 15-17 2002
MySRB: Web-bases Access to the SRB
• Browse in Hierarchical Collections• Registration of (remote) Legacy Files & Directories• Registration of SQL Objects• Registration of URLs• Data Movement Operations
– Ingest & Re-Ingest, Delete, Unlink– Replicate, Copy, Move, S-Link
• Access Control Operations– Read, Write, Own, Curate, Annotate, …– Ticket-based Access
• Version Control Operations – Read Lock, Write Lock, Unlock– Check In Check Out
19 RCDL’02, Dubna, October 15-17 2002
Meta data Management in MySRB• Types of Meta Data
– System-level Metadata• Size, resource, owner, date, access
control, …– User-defined Meta data
• for data & collections• <name,value,unit> triples• No limits in number of metadata• Support for Collection-level schemas
– Comments, default values, drop-down lists
• Support for Standardized Schemas – (eg. Dublin Core)
– Annotations• Supports textual annotations• Annotator, date, context also registered
20 RCDL’02, Dubna, October 15-17 2002
SRB Projects• Digital Libraries
– UCB, Umich, UCSB, Stanford,CDL– NSF NSDL - UCAR / DLESE
• NASA Information Power Grid• DOE ASCI Data Visualization Corridor • Astronomy
– National Virtual Observatory – 2MASS Project (2 Micron All Sky Survey)
• Particle Physics – Particle Physics Data Grid (DOE)– GriPhyN – SLAC Synchrotron Data Repository
• Medicine– Visible Embryo (NLM)
• Earth Systems Sciences– ESIPS– LTER
• Persistent Archives– NARA– LOC
• Neuro Science & Molecular Science– TeleScience, Brain Images, BIRN– JCSG (SSRL/SLAC), AfCS, …
21 RCDL’02, Dubna, October 15-17 2002
Large Data Project Examples• Astronomy:
– National Virtual Observatory • Integrate 18 sky surveys- (ITR prop)
– 2MASS Project (2 Micron All Sky Survey) • 10TB; 5million files• Co-locate Images for Spatial Access• Data Mining across entire collection• Replicate to CalTech HPSS
• Particle Physics: – Particle Physics Data Grid (DOE)– GrPhyN (NSF ITR proj)
• CERN LHC 1PB/yr (1billion obj)• Multi-Lab integration
– SLAC Synchrotron Data Repository
22 RCDL’02, Dubna, October 15-17 2002
Compute Resources Catalogs Data Archives
InformationDiscovery
Metadatadelivery
Data Discovery
Data Delivery
Catalog Mediator Data mediator
1. Portals and Workbenches
Bulk DataAnalysis
CatalogAnalysis
MetadataView
DataView
4.GridSecurityCachingReplicationBackupScheduling
2.Knowledge & ResourceManagement
Standard Metadata format, Data model, Wire format
Catalog/Image Specific Access
Standard APIs and Protocols Concept space
3.
5.
6.
7. Derived Collections
National Virtual ObservatoryData Grid
23 RCDL’02, Dubna, October 15-17 2002
24 RCDL’02, Dubna, October 15-17 2002
25 RCDL’02, Dubna, October 15-17 2002
Digital Sky Data Ingestion
Informix
SUN
SRBSUN E10K
HPSS
….
800 GB
10 TB
SDSCIPAC CALTECH
input tapes from telescopes
star catalogData
Cache
26 RCDL’02, Dubna, October 15-17 2002
Digital Sky Data Ingestion• The input data was on tapes in a random (temporal…) order.• Ingestion nearly 1.5 year - almost continuous, 4 parallel streams (4
MB/sec per stream), 24*7*365• Total 10+TB, 5 million, 2 MB images in 147,000 containers. • SRB performed a spatial sort on data insertion (Scientists view/analyze data by
neighborhood). The disc cache (800 GB) for the HPSS containers was utilized.
• Ingestion speed limited by input tape reads– Only two tapes per day can be read
• Work flow incorporated persistent features to deal with network outages and other failures.
• C API was utilized for fine grain control and to be able to manipulate and insert metadata into Informix catalog at IPAC Caltech. – http://www.ipac.caltech.edu/2mass
27 RCDL’02, Dubna, October 15-17 2002
DigSky Conclusion
• SRB can handle large number of files• Metadata access is still less than ½ sec delay• Replication of large collections• Single command for geographical replication• On-the-fly sorting (out-of-tape sorting)• Availability of data otherwise not possible• Near-line access to 5 million files (10 TB)• Successfully used in web-access & large scale
analysis (daily)
28 RCDL’02, Dubna, October 15-17 2002
Demonstration
• goto mySRB• For Additional Information:
http://www.npaci.edu/dice/[email protected]
MIX:Mediation of Information
using XML
30 RCDL’02, Dubna, October 15-17 2002
Data Source(eg. home ads)
Native XMLDatabase
XML ViewDocument(s)
XML ViewDocument(s)
XML ViewDocument(s)
Export: • Schema & Metadata (DTD, RDF,…)• Capabilities
Wrapper
LegacySource
XML Query
Wrapper
XML
Mediation of Information using XML (MIX)
31 RCDL’02, Dubna, October 15-17 2002
Query
Query “fragment”
A Typical Mediation Scenario
Mediator(integrated views over heterogeneous sources)
Wrapper
UserInterface
Convert incoming queryand outgoing data
SQL Database
Wrapper Wrapper
GIS HTML
Results
Query “fragment”
32 RCDL’02, Dubna, October 15-17 2002
XMAS Query
The Home Buyer Scenario
MIXmMediator
N’hood info(demographics)
“Neighborhood” mediator
WebClient
“Homes” mediator
Results (XML)
www.realtor.com www.homeadvisor.msn.comwww.sandag.cog.ca.us www.sannet.gov
Community info(name, ZIP)
Crime info(ZIP, stats)
Home info(real estate) Schools info
(address, size)
School district info
(scores,spending,ZIP)
“Schools” mediator
National test scores
Data Data
Data
www.asd.com
33 RCDL’02, Dubna, October 15-17 2002
Home Buyer GUI
34 RCDL’02, Dubna, October 15-17 2002
<folder> $C $S for $S</folder> for $C
$C:<*.condo> <address zip=$Z/> </condo> AT www.condo.com AND$S:<*.school type=elementary> <address zip=$Z/> </school> AT schools.org
... <RealEstateAgent> <name>J. Smith</name> <condos> <condo> <address ... zip=92037> <price>$170k OBO</price> <bedrooms>2</bedrooms> </condo> <condos> </RealEstateAgent>
<condosAndSchools> <folder> <condo> <address ... zip=92037> <price>$170k OBO</price> <bedrooms>2</bedrooms> </condo> <school> <name>La Jolla High</name> <address … zip=92037> </school> <school>…</school> </folder>
An XML Query (XMAS)
35 RCDL’02, Dubna, October 15-17 2002
Home Buyer GUI (Answers)Generated
XMAS QueryXML Answer
Document
36 RCDL’02, Dubna, October 15-17 2002
Our Research
• In what query language does the user pose a query?
• How does the query engine of the mediator rewrite the query?
• How does the mediator combine/restructure/post-process partial results?
• What data model and query transformation scheme should the wrappers use for different source types?
For details: http://www.npaci.edu/DICE/MIX
Mediator
S1S1
W1
S2
W2
S3
W3
User QueryXMAS
XML
37 RCDL’02, Dubna, October 15-17 2002
New MIX Challenges from Scientific Applications• Complex Data
– SDSC’s Scientific Data Applications (current/planned, e.g. Neurosciences: NCMIR, NIH BIRN, Earth sciences: GEON, GeoGrid, ...) show that syntactic/structural integration is insufficient for ... Complex Multiple-World Mediation Problems:
– complex, disjoint, seemingly unrelated data– “hidden semantics” in complex, indirect relationships
=> Semantic (aka Model/Knowledge-Based) Mediation – lift mediation to the level of conceptual models (CMs)– use domain experts’ knowledge formalized as rules over CMs
=> Specialized Extensions • temporal, geospatial, statistical, DQ/accuracy... operations
=> Extend Mediation Scope and Power via Deductive Rules
INFORMATION MEDIATION WITH
DOMAIN MAPS
39 RCDL’02, Dubna, October 15-17 2002
An Unresolved ChallengeHow do nerve cells change as we learn and remember?
A multi-resolution study of the rat hippocampus at Boston University
40 RCDL’02, Dubna, October 15-17 2002
Dendritic spine morphology and its variationsDendritic spine morphology and its variations
Reconstructions from the Synapse Lab, Boston University
density = #spines/length
41 RCDL’02, Dubna, October 15-17 2002
• Distribution of spines changes with learning• Each spine type performs a different task in information transmission
HypothesisHypothesis
ObservationsObservations
• Spine density, size, shape and PSD vary with maturity• Spine neck geometry controls peak Calcium amount• Calcium flow parameters depend on the different subclasses of spines
Next QuestionsNext Questions
• Does anyone else have corroborative evidence for these observations?• Are these observations true in other comparable parts of the brain?• Is this consistent with the distribution of Calcium-binding proteins?
42 RCDL’02, Dubna, October 15-17 2002
Example for Formalizing Domain Knowledge:Domain Map for SYNAPSE and NCMIR
A domain map comprises• Description Logic facts ...
- concepts ("classes") - roles ("associations")
• derived properties ...• ... expressed as logic rules
- (e.g. F-logic)
domain map
Purkinje cells and Pyramidal cells have dendritesthat have higher-order branches that contain spines.Dendritic spines are ion (calcium) regulating components.Spines have ion binding proteins. Neurotransmissioninvolves ionic activity (release). Ion-binding proteinscontrol ion activity (propagation) in a cell. Ion-regulatingcomponents of cells affect ionic activity (release).
domain expert knowledge
equivalent Description Logic facts
Extended Mediator Architecture for Semantic Mediation USER/ClientUSER/Client
S1 S2
S3
XML-Wrapper
CM-WrapperXML-Wrapper
CM-WrapperXML-Wrapper
CM-Wrapper
GCMCM S1
GCMCM S2
GCMCM S3
CM (Integrated View)
MediatorEngine
FL rule proc.
LP rule proc.
Graph proc.XSB Engine
Domain MapDM
Integrated View Definition IVD
Logic API(capabilities)
CM Queries & Results (exchanged in XML)
CM Plug-In
44 RCDL’02, Dubna, October 15-17 2002
Comparison & Summary: Semantic Mediation
(Complex) Single World/ Simple Multiple World
Complex Multiple World
Integration target global schema(common / shared)
1..n shared domain maps
Example scenario suppliers’ catalogs/ home buyer
complex scientific data (neuroscience, geoscience,…)
Schema level overlapInstance level overlap
large / smalllarge / none
none … smallnone
Source correlation direct, instance / schema level indirect, conceptual (knowledge)level
Techniques schema transformations, schemaintegration
“structural” integration
domain maps, formalized domainknowledge (“semantic bridges”)=> model-based (“semantic”)
mediationIntegration languagesExpressiveness
relational, semistructured,queries & transformations
(e.g., SQL, XQuery, XSLT)
conceptual (description logics),object-oriented, deductive features
(e.g., GCM, F-logic)Integrators DB expert domain expert + KRDB expert
Part II: case studies
BIRNWeb Services
Persistent Archives
46 RCDL’02, Dubna, October 15-17 2002
NIH is Funding a Brain Imaging Federated Repository
National Partnership for Advanced Computational Infrastructure
Part of the UCSD CRBS Center for Research on Biological Structure
Biomedical Informatics Research Network
(BIRN)NIH Plans to Expand
to Other Organs and Many Laboratories
Infrastructure for Sharing Neuroscience Data
CCB, Montana SUSurface atlas, Van Essen Lab NCMIR, UCSDstereotaxic atlas LONI MCell, CNL, Salk
SOURCES:• NCMIR, U.C. San Diego• Caltech Neuroimaging• Center for Imaging Science, John Hopkins• Center for Computational Biology, Montana State• Laboratory of Neuro Imaging (LONI), UCLA• Computatuonal Neurobiology Laboratory, Salk Inst.• Van Essen Laboratory, Washington University• …
Data Management Infrastructure (DAKS/NPACI)• MIX Mediation in XML • MCAT information discovery• SRB data handling • HPSS storage• ...
Knowledge-based GRID
infrastructure
? ? ? ?
Data Management Infrastructure (“Data Grid”)GTOMO, Telemicroscopy, Globus, SRB/MCAT, HPSS
The Need for Semantic Integration
protein localization
What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity?
How about other rodents?
morphometry neurotransmission
???Mediator ???
Web
CaBP, Expasy
Wrapper WrapperWrapper Wrapper
??? Integrated View ???
??? Integrated View Definition ???
Data, relationships,
constraints are modeled (CMs)
Cross-source relationships are
modeled
Semantic (knowledge-
based) mediation services
Cross-source queries
Hidden Semantics: Protein Localization
<protein_localization><neuron type=“purkinje cell” /><protein channel=“red”><name>RyR</>….</protein><region h_grid_pos=“1” v_grid_pos=“A”><density> <structure fraction=“0.8”>
<name>spine</><amount name=“RyR”>0</>
</> <structure fraction=“0.2”>
<name>branchlet</><amount name=“RyR”>30</>
</>
Molecular layer ofCerebellar Cortex
Purkinje Cell layer ofCerebellar Cortex
Fragment of dendrite
Mediation Services: Source Registration (System Issues)
SourceData Type
Access Protocol
Query Capability
table tree file
SRB HTTP
JDBC
SQL XMLQL
DOODARC
Result Delivery
Tuple-at-a-time Set-at-a-
timeStream
Binary for Viewer Selections SPJ
Mediation Services: Source Registration (Semantics Issues)
• Domain Map Registration– provide concept space/ontology
• … as a private object (“myANATOM”)• … merge with others (give “semantic bridges”)• … and check for conflicts
• Conceptual Model Registration– schema: classes, associations, attributes– domain constraints – “put data into context” (linking data to the domain map)
Next
Mediation Services: Integrated View Definition
DERIVEprotein_distribution(Protein, Organism, Brain_region, Feature_name,
Anatom, Value) FROM I:protein_label_image[ proteins ->> {Protein}; organism -> Organism;
anatomical_structures ->>{AS:anatomical_structure[name->Anatom]}] , % from PROLAB
NAE:neuro_anatomic_entity[name->Anatom; % from ANATOM located_in->>{Brain_region}], AS..segments..features[name->Feature_name; value->Value].
• provided by the domain expert and mediation engineer• declarative language (here: Frame-logic)
Mediation Services: Semantic Annotation Tools
line drawing annotation (spatial) database for mediation
Part II: case studies
Web Services
56 RCDL’02, Dubna, October 15-17 2002
Web Services Demo 1
OracleDBMS
JavaServlets
Web ServerSOAP
XML Mediator (Enosys)
Clients: AxioMap, Polexis
XMLXML query (XCQL)
SociologyWorkbench
WSDL
OracleDBMS
JavaServlets
Web ServerSOAP
WSDL
Java Servlet
Spatial Mediator
Find school districts in San Diego where computer ownership rates among residents are over 80%
San Diego Digital Divide Survey
Boundaries of municipalities
and school districts
57 RCDL’02, Dubna, October 15-17 2002
Web Services Demo 2
ESRI ArcObjects
Web ServerSOAP
XML
CoordinateConversionService
WSDL
EPA Envirofacts Website
XML Wrapper
Java Servlet
Spatial Mediator
Local Pollution Data
XML Wrapper
Web spatial source, EPA dataArcObjects spatial service
58 RCDL’02, Dubna, October 15-17 2002
Web Services Demo 3GIS source,WSDL: for spatial analysis, survey data analysis, DBMS queryUCR/FBI dataProcess flow across Web services
Counties crossed by an
interstate
Counties with decrease in victims of firearms over … %,
1993-99
Counties with decrease in homicide
rates over … %,
1993-99
UCR summaries
, Oracle
Victim data,SWB
Spatial Query,ArcIMS/
ArcObjects
WSDL WSDL
Part II: case studies
Persistent Archives
60 RCDL’02, Dubna, October 15-17 2002
Persistent Archives• NARA project• Store & Recover Data after 400 years• 5 million emails• 33 million web
pages• 90 million
personnel records
61 RCDL’02, Dubna, October 15-17 2002
Persistent Archives • Challenges: each of the software and hardware systems may
become obsolete– the storage media may degrade– the storage system may become obsolete– the database backups may become obsolete, with no way to recover the
collection (structure)– the digital object formats may become obsolete, with no helper application
that can read them• Persistent archive is a migration mechanism
– support for automatic migration to new technology; automatic ingestion, management, access, catalog discovery
• Infrastructure independence– Non-proprietary formatting -- Collection management -- Data set access –
Authentication -- Presentation• Persistent archive is an interoperability system
– XML as a (meta-) information markup language
62 RCDL’02, Dubna, October 15-17 2002
Persistent Archive
Persistent archiveDescribe archived data as collectionsDescribe processes used to create collectionsManage evolution of technology
Database A(today)
Database A(tomorrow)
Virtual Data Grid
The persistent archive is itself a virtual data grid that provides mechanisms to manage migration to new technology
63 RCDL’02, Dubna, October 15-17 2002
Information Hierarchy (Simplest Definitions)• Data
– digital object, i.e., the object representation as a bit stream• Information
– any tagged data, where tags are treated as information attributes– attributes may be tagged data within the digital object, or tagged data that
is associated with the digital object• Knowledge
– higher-order concepts and relationships between attributes– relationships can be procedural, temporal, structural, spatial, functional, ...
and described in a Logic formalism (semantic networks, description logics, conceptual graphs, ...) which is often rule-based (e.g. Datalog, Frame-Logic)
64 RCDL’02, Dubna, October 15-17 2002
What Types of Interoperability are Needed?
• Data management (digital objects)– ability to work with multiple types of storage systems, across
separate administration domains • Information management (attributes)
– ability to define a collection independent of database choice– ability to migrate collection onto new databases
• Knowledge management (relationships)– ability to manage relationships and high-level domain concepts– ability to map concepts to collection attributes
65 RCDL’02, Dubna, October 15-17 2002
From XML-Based to Knowledge-Based Archives
• Collection-based archival with XML: save data "as is" plus...– ... separate content from presentation– ... tag your data (take a lift in the info hierarchy)– ... use a self-describing, semistructured data format (XML)
• Knowledge-based archival: now add ...– ... conceptual level information– ... integrity constraints– ... explanations/derivation rules:
• archiving only results y=f(x) vs. archiving the rules/function "f" (e.g. f = “the Florida procedure”...)
=> employ knowledge representation languages
66 RCDL’02, Dubna, October 15-17 2002
Knowledge-Based Persistent Archive
AttributesSemantics
Knowledge
Information
Data
Ingest Services
Management AccessServices
(Topic Maps / Model-based Access)
(Data Handling System - SRB / FTP / HTTP)
MC
AT
/HD
F
Grid
s
XM
L D
TD
SDL
IP
XT
M D
TD
Rul
es -
KQ
L
InformationRepository
Attribute- based Query
Feature-basedQuery
Knowledge orTopic-Based Query / Browse
KnowledgeRepository for Rules
RelationshipsBetweenConcepts
FieldsContainersFolders
Storage(Replicas,Persistent IDs)
67 RCDL’02, Dubna, October 15-17 2002
Knowledge-Based Archival: Senate Example Data provider says:
“Please archive all records of legislative activities of the 106th senate!”Integrity constraints, eg:
(1) {senators_with_file} = UNION (sponsor, cosponsors, submitted_by) (2) {senators} = {sponsors} = {co-sponsors}
Violation: – the rhs is a SUPERSET of the lhs !
Exceptions:– (Chafee, John), (Gramm, Phil), (Miller, Zell)
(Possible) Explanations: – senators who joined (Zell), passed away (Chafee), were forgotten (Gramm)!?
Checking ICs:IF sponsor(X), not senator(X) THEN ADD(exception_log, missing_senator_info(X))
IF condition THEN action Action = LOG, WARN, ABORT, ...
68 RCDL’02, Dubna, October 15-17 2002
NARA Herbicides Collection:Introduction
69 RCDL’02, Dubna, October 15-17 2002
The Herbicides Collection - input
6507213207565 260404040 040000{0000D0000000{048{ {0000000{0000000{0000000{0000000{6507243207565 260606060 060000{0000D0000000{072{ {0000000{0000000{0000000{0000000{6507253207565 260606060 060000{0000D0000000{072{ {0000000{0000000{0000000{0000000{6507263207565 260606060 060000{0000D0000000{072{ {0000000{0000000{0000000{0000000{6507273207565 260606060 060000{0000D0000000{072{ {0000000{0000000{0000000{0000000{6507283207565 260505050 050000{0000D0000000{060{ {0000000{0000000{0000000{0000000{6507293207565 260404040 040000{0000D0000000{048{ {0000000{0000000{0000000{0000000{6508022022365 060202020 010000{0000C0000000{012{ {0000000{0000000{0000000{0000000{1A
AS890255 000{000{6508022022365 1B
AS940140 000{000{6508042022365 060202020 006000{0000C0000000{007B {0000000{0000000{0000000{0000000{1A
AS925205 000{000{6508042022365 1B
AS970065 000{000{6508062022365 060202020 004000{0000C0000000{004H {0000000{0000000{0000000{0000000{1A
BS290320 000{000{6508062022365 1B
BS275298 000{000{6508073207565 260202020 020000{0000D0000000{024{ {0000000{0000000{0000000{0000000{1A
YT080110 000{000{6508073207565 1B
YT110060 000{000{6508113207565 260202020 020000{0000D0000000{024{ {0000000{0000000{0000000{0000000{6508123207565 260202020 020000{0000D0000000{024{ {0000000{0000000{0000000{0000000{6508151022465 020202020 008000{0000C0000000{009F {0000000{0000000{0000000{0000000{1A
YD350155 000{000{6508151022465 1B
YD450150
From EBCDIC tapes:
70 RCDL’02, Dubna, October 15-17 2002
The Herbicides Collection - preservationConverted to XML:
<YEAR><yearnum>66</yearnum><MONTH><monthnum>01</monthnum><DATE><datenum>01</datenum><MISSION><num>206866</num>
<RUN><code>A</code><ctz>3</ctz><multi></multi><prov>27</prov>
<aircrafts><scheduled>02</scheduled><airborne>02</airborne><productive>02</productive>
</aircrafts><agent>O</agent><gal>02000</gal><hits>0</hits><aborts><maintenance>0</maintenance><weather>0</weather><battle_damage>0</battle_damage><other>0</other></aborts><type>D</type><area>024</area><rsult></rsult><UTM>
<utmid>1A</utmid><utm_coor>YS240780</utm_coor>
</UTM><UTM>
<utmid>1B</utmid><utm_coor>YS290630</utm_coor>
</UTM></RUN><RUN><code>B</code><ctz>3</ctz><multi></multi><prov>27</prov>
<aircrafts><scheduled>02</scheduled><airborne>02</airborne><productive>02</productive>
</aircrafts><agent>O</agent><gal>02000</gal><hits>0A</hits><aborts><maintenance>0</maintenance><weather>0</weather><battle_damage>0</battle_damage><other>0</other></aborts><type>D</type><area>024</area><rsult></rsult>
MAPPING
71 RCDL’02, Dubna, October 15-17 2002
From Geography Markup to Rendering<?xml version="1.0" encoding="iso-8859-1"?><rs><r><name>Horton Plaza</name><URL></URL><labelpos>41.46,77.51</labelpos><c>5076,1540 4986,1540 4895,1539 4803,1539 4715,1539 4622,1539 4534,1538 4534,1641 4534,1745 4534,1856 4622,1856 4711,1856 4800,1856 4893,1855 4984,1855 5075,1854 5075,1749 5076,1646 </c></r><r><name>Gaslamp</name><URL></URL><labelpos>44.60,83.00</labelpos><c>5162,1013 5084,1057 5083,1116 5081,1222 5079,1326 5079,1433 5076,1540 5076,1646 5075,1749 5075,1854 5167,1854 5257,1855 5257,1750 5259,1647 5260,1541 5262,1434 5262,1328 5263,1222 5263,1013 </c></r>. . .XML encoding of geographic features (such as GML)
<?xml version="1.0"?><!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20000303 Stylable//EN" "http://www.w3c.org/2000/svg10-20000303-stylable" [<!ENTITY base "fill:#ff0000;stroke:#000000;stroke-width:1;">]><svg width="100%" height="100%" viewBox="0 0 11590 7547" style="shape-rendering:geometricPrecision; text-rendering:optimizeLegibility"><g id="karta" transform="scale(1, -1) translate(0, -7547)"><g id="base" style="&base;"><path id="a1" title="Horton Plaza" style="fill:#00ff00;" d="M5076,1540L 4986,1540 4895,1539 4803,1539 4715,1539 4622,1539 4534,1538 4534,1641 4534,1745 4534,1856 4622,1856 4711,1856 4800,1856 4893,1855 4984,1855 5075,1854 5075,1749 5076,1646 5076,1540z"/><path id="a2" title="Gaslamp" style="fill:#ffff00;" d="M5162,1013L 5084,1057 5083,1116 5081,1222 5079,1326 5079,1433 5076,1540 5076,1646 5075,1749 5075,1854 5167,1854 5257,1855 5257,1750 5259,1647 5260,1541 5262,1434 5262,1328 5263,1222 5263,1013 5162,1013z"/></g></g></svg>
VML or SVG or…
SVG
72 RCDL’02, Dubna, October 15-17 2002
XML Map Viewer for
the Herbicides Collection
73 RCDL’02, Dubna, October 15-17 2002
Conclusion• Necessity & Requirements of a Virtual Data Grid• SRB – a proven solution
– It is an existing middle-ware– Field-tested in multiple projects– Proven Scalability: users, data & resources
• New element of data grid: knowledge management• Working solutions
– BIRN: the first real data grid complete with knowledge management and cross-ontology bridges
– Web services, to expose grid functionality in a uniform way
– Archiving data, information and knowledge as a gridactivity
• www.npaci.edu/DICE/