a software architecture for highly data-intensive systems

23
A Software Architecture for Highly Data-Intensive Systems Chris A. Mattmann [email protected] USC Center for Software Engineering Annual Research Review March 2004 Special thanks to Dan Crichton, Steve Hughes, and Sean Kelly for some of the slides!

Upload: beth

Post on 14-Jan-2016

36 views

Category:

Documents


1 download

DESCRIPTION

A Software Architecture for Highly Data-Intensive Systems. Chris A. Mattmann [email protected] USC Center for Software Engineering Annual Research Review March 2004. Special thanks to Dan Crichton, Steve Hughes, and Sean Kelly for some of the slides!. Overview. Motivation Problem Statement - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Software Architecture for Highly Data-Intensive Systems

A Software Architecture for Highly Data-Intensive Systems

Chris A. [email protected] USC Center for Software EngineeringAnnual Research ReviewMarch 2004

Special thanks to Dan Crichton, Steve Hughes, and Sean Kelly for some of the slides!

Page 2: A Software Architecture for Highly Data-Intensive Systems

Overview

Motivation Problem Statement OODT: A Software Architecture and

Middleware for Data-Intensive Systems Evaluation: Science Problems

Planetary Science Cancer Research

Conclusion

Page 3: A Software Architecture for Highly Data-Intensive Systems

Motivation

Page 4: A Software Architecture for Highly Data-Intensive Systems

Problem Statement

Information Integration in Data-Intensive Systems Needed to support data access, distribution, processing and retrieval

across existing heterogeneous data sources NASA’s Planetary Data System NCI’s Early Detection Research Network

Software and Techniques exist to perform Information Integration But…..

No Software Re-use No Design Methods to start from No mapping of integration techniques to software components, interaction

mechanisms, or arrangements of components Lack of Re-use and software standards for information integration in

data-intensive systems has forced systems to be “built from scratch” Little or no interoperability with other software systems Programmer almost always “in the loop” New GDS proposal accompanies most new NASA mission proposals

Page 5: A Software Architecture for Highly Data-Intensive Systems

Our Approach

A Software Architecture for Data-Intensive Systems Data Architecture

Data Dictionary Resource Profiles

Software Architecture Components: Product Servers, Profile Servers, Query Servers Connector: Messaging Layer Configurations of Product/Profile/Query Servers

..and a middleware implementation based on the software architecture Middleware leverages existing distributed object middleware frameworks such as

CORBA, RMI We’re currently working on a SOAP version Built and maintained at the Jet Propulsion Laboratory

Yes, the Mars folks

Architecture+middleware = OODT (Object Oriented Data Technology) Middleware being developed at JPL Architecture being formalized at USC-CSE

Page 6: A Software Architecture for Highly Data-Intensive Systems

Data Dictionary

Common Data Model containing Data Elements which the user is interested in querying for Data Elements which the user would like to retrieve

Challenge: Integrate data sources linked in by exploiting the Data

Dictionary structure Map common data model to data source models across

data-intensive system Use a common data element structure

ISO-11179 Specification and Standardization of Data Elements

Handles the integration of data models across the system, but still need to integrate software interfaces

Page 7: A Software Architecture for Highly Data-Intensive Systems

Resource Profiles

Provides mechanisms for describing data systems, data products, etc including Common data attributes using Dublin Core (I.e. Title, Author,

Subject) data elements to describe electronic resources Mechanisms for describing where the data is located and how to

access it Domain data elements that are useful for describing the product

(i.e. TARGET_NAME, MISSION_NAME, INSTUMENT_NAME, etc)

Enables “search and retrieval” of distributed data products Searches to a Profile Server yields information regarding the

characteristics of distributed resources (i.e. descriptive information about the product, access information, etc)

Page 8: A Software Architecture for Highly Data-Intensive Systems

Resource Profiles Example “country = US and windspeed > 120”

<profile>… <resAttributes>… <resLocation>urn:eda:rmi:Western… <profileElement> <elemName>country</elemName>… <elemValue>US</elemValue>… <profileElement> <elemName>state</elemName>… <elemValue>WA</elemValue> <elemValue>CA</elemValue>… <profileElement> <elemName>windspeed</elemName>… <elemMinValue>3</elemMinValue> <elemMaxValue>146</elemMaxValue>…

<profile>… <resAttributes>… <resLocation>urn:eda:rmi:Southern… <profileElement> <elemName>country</elemName>… <elemValue>US</elemValue>… <profileElement> <elemName>state</elemName>… <elemValue>LA</elemValue> <elemValue>TX</elemValue>… <profileElement> <elemName>windspeed</elemName>… <elemMinValue>1</elemMinValue> <elemMaxValue>89</elemMaxValue>…

Matches!

Page 9: A Software Architecture for Highly Data-Intensive Systems

Components

Product Server Responsible for abstracting heterogeneous data source

interfaces Attach a Product Server to each data source that is integrated

Provides a common query interface across heterogeneous data sources

Profile Server Describe data resources using resource profiles

Allow data resources to be discovered and located at query-time Query Server

Tie it all together Uses Profile Servers to discover data resources which could

potentially satisfy a query Queries discovered data resources (such as Product Servers)

and collects obtained data products to return to the user

Page 10: A Software Architecture for Highly Data-Intensive Systems

Connectors

Messaging Layer Each OODT component registers itself with a

Component Registry Allows Components to define and provide services Components defined by unique URNs

Transfers OODT Query Object containing OODT Style Query

(Keyword = Value) predicates joined by logical operators (AND, OR, etc)

The result list to be populated

Page 11: A Software Architecture for Highly Data-Intensive Systems

Configurations: Example

Page 12: A Software Architecture for Highly Data-Intensive Systems

Configurations: Example (2)

Page 13: A Software Architecture for Highly Data-Intensive Systems

Configurations: Example (3)

Page 14: A Software Architecture for Highly Data-Intensive Systems

Planetary Science

Planetary Data System Official NASA “Active” Archive for all Planetary Data

Data ingestion required as part of Announcement of Opportunity (AO) for a mission

9 Nodes with data located at discipline sites Common Data Architecture Different data systems located at the sites Prior to October 2002, no ability to find and share data

between PDS nodes Data distribution via CD ROM Limited electronic distribution

Page 15: A Software Architecture for Highly Data-Intensive Systems

OODT PDS Deployment

Page 16: A Software Architecture for Highly Data-Intensive Systems

Early Detection Research Network

OODT’s success has lead to interagency agreements with both NIH and NCI

OODT has provided the NCI with a bioinformatics infrastructure for sharing data across the nation Currently deployed at 10 of 31 NCI Research Institutions for the Early

Detection Research Network (EDRN) Providing real-time access to distributed, heterogeneous databases Created a national virtual repository for biospecimens (now a NCI

Director Initiative) Now integrating new datasets: validation studies, images, biomarkers,

etc Meet Federal security regulations Operational September 2002

Same core software framework as deployed in planetary, earth and engineering

Page 17: A Software Architecture for Highly Data-Intensive Systems

OODT EDRN Deployment

Page 18: A Software Architecture for Highly Data-Intensive Systems

Conclusion

OODT is….. A novel software architecture to describe data intensive

systems integration, search, retrieval and discovery of heterogeneous

data stored in heterogeneous domain data sources A reference implementation of above software architecture

Java-based middleware C++. Perl, Python, PHP Client APIs

A process for annotating and creating standard metadata models to describe heterogeneous data based on data standards Dublin Core ISO-11179

Page 19: A Software Architecture for Highly Data-Intensive Systems

Referred Papers

Mattmann C, Ramirez P, Crichton D, and Hughes, J.S. Packaging Data Products using Data Grid Middleware for Deep Space Mission Systems. Accepted for Publication at the 8th International Conference on Space Operations, Montreal, Canada, 2004.

Mattmann C, Freeborn D, Crichton D. Towards a Distributed Information Architecture for Avionics Data. In Proceedings of the 2nd International IADIS Conference on the World-Wide-Web and Internet, Volume II, pp 829-832. Algarve, Portugal, 2003.

Crichton D, Hughes, J.S., Kelly, S. A Science Data System Architecture for Information Retrieval. Clustering and Information Retrieval. Kluwer Academic Publishers. December 2003.  - Book Chapter on OODT

Crichton D, Hughes, J.S., Kelly, S, Rameriz, P. A Component Framework Supporting Peer Services for Space Data Management. 2002 IEEE Aerospace Conference. Big Sky, Montana. March 2002. 

Crichton D, Downing G, Hughes J. S, Kincaid H, Srivistava S. An Interoperable Data Architecture for Data Exchange in a Biomedical Research Network. 14th IEEE Symposium on Computer-Based Medical Systems. July 2001.  

Crichton, D., Hughes J. S, Hardman S, Kelly S. A Distributed Component Framework for Data Product Interoperability. 17th CODATA International Conference, Baveno, Italy. October 2000.

Crichton, D., Hughes J. S, Kelly S, Hyon J. Science Search and Retrieval using XML. Second National Conference on Scientific and Technical Data, Washington D.C., National Academy of Sciences. March 2000.

Page 20: A Software Architecture for Highly Data-Intensive Systems

Questions?

Contacts OODT Website: http://oodt.jpl.nasa.gov Principal Investigator

Dan Crichton ([email protected])

Co-Investigator Steve Hughes ([email protected])

Programmer/Research Grunt Me ([email protected])

Thanks for your attention!

Page 21: A Software Architecture for Highly Data-Intensive Systems

Backup Slides

Page 22: A Software Architecture for Highly Data-Intensive Systems

Resource Profiles Example “country = US and windspeed > 120”

<profile>… <resAttributes>… <resLocation>urn:eda:rmi:Western… <profileElement> <elemName>country</elemName>… <elemValue>US</elemValue>… <profileElement> <elemName>state</elemName>… <elemValue>WA</elemValue> <elemValue>CA</elemValue>… <profileElement> <elemName>windspeed</elemName>… <elemMinValue>3</elemMinValue> <elemMaxValue>146</elemMaxValue>…

<profile>… <resAttributes>… <resLocation>urn:eda:rmi:Southern… <profileElement> <elemName>country</elemName>… <elemValue>US</elemValue>… <profileElement> <elemName>state</elemName>… <elemValue>LA</elemValue> <elemValue>TX</elemValue>… <profileElement> <elemName>windspeed</elemName>… <elemMinValue>1</elemMinValue> <elemMaxValue>89</elemMaxValue>…

Matches!

Page 23: A Software Architecture for Highly Data-Intensive Systems

Object Oriented Data Technology

Object-Oriented Data Technology (OODT) Funded in 1998 by NASA’s Office of Space Science to develop a

national software framework for sharing data across heterogeneous, distributed data repositories

Develop… a common data and software framework to enable data sharing

across multiple science and engineering disciplines A reusable software architecture across data management projects

Reusable software components with common interfaces Interfaces to enable new components to be plugged in Mechanism to wrap legacy data system components with minimal impact

OODT should provide.. Science domain independence (use in engineering, science and

biomedicine) Data location independence (describe what you want, not

how/where to get it