thomas huang po.daac software system engineer jet propulsion laboratory california institute of...
TRANSCRIPT
Thomas HuangPO.DAAC Software System Engineer
Jet Propulsion LaboratoryCalifornia Institute of Technology
These activities were carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space
Administration. © 2010 California Institute of Technology. Government sponsorship acknowledged.
About me…
PO.DAAC Software System Engineer and Architect of its Data Management and Archive System
Background in planetary data management, secure near real-time distribution systems
Huang - 01062010
OutlinePattern for data ingestion to distributionOur legacy data systemThe new PO.DAAC Data Management and
Archive SystemConclusionQ&A
Huang - 01062010
Simple Pattern
Huang - 01062010
Can All These Broken Pieces Fit?
Huang - 01062010
Legacy Data Systems
Huang - 01062010
… It Works!?
3 different data systems according to the simple pattern
Deployed in multiple instances
Mostly consists of one-off scripts
Limited reusability Limited portability Scalability? Reliability?
stovepipe
Legacy Data Systems
Huang - 01062010
Our N
ew
Data M
anagement and Archive System
Huang - 01062010
Huang - 01062010
Software Development Process
Technologies and Standards
Huang - 01062010
Documents
Huang - 01062010
Architecture A system of RESTful
services Standardized messages
exchange between services
Unified data model Distributed data ingestion
services Standardized event
tracking and notification service
Huang - 01062010
Manager Webservice
• Transaction-Oriented
• Load-Balanced job assignment
• On-The-Fly Deployment of
Engines
• Dynamic support of new data
product
• State-Driven Product
Management
• Resource Management
• Transaction-Oriented
• Load-Balanced job assignment
• On-The-Fly Deployment of
Engines
• Dynamic support of new data
product
• State-Driven Product
Management
• Resource Management
RESTful
Huang - 01062010
File Management Engines
RESTful
• Lightweight RESTful file service
• Supports typical file operations (add,
move, delete, etc.)
• A single instance can carryout
multiple granule operations in
parallel
• Supports various file protocols (FTP,
SFTP, FILE, HTTP… etc.)
• Tracks and limits the number of jobs
it can handle
• Trans and limits the number of
outbound communications
• Typical instances: ingest, archive,
and purge
• Lightweight RESTful file service
• Supports typical file operations (add,
move, delete, etc.)
• A single instance can carryout
multiple granule operations in
parallel
• Supports various file protocols (FTP,
SFTP, FILE, HTTP… etc.)
• Tracks and limits the number of jobs
it can handle
• Trans and limits the number of
outbound communications
• Typical instances: ingest, archive,
and purge
Huang - 01062010
Product Inventory• Unified Metadata Data Model
• References applicable models
(e.g. ISO 19115, DIF, DIF, ECHO,
GCMD…)
• Extensible to support capturing
of collection/dataset/granule-
specific data attributes
• Support geospatial data
• Support project-specific data
archive and distribution policies
• Unified Metadata Data Model
• References applicable models
(e.g. ISO 19115, DIF, DIF, ECHO,
GCMD…)
• Extensible to support capturing
of collection/dataset/granule-
specific data attributes
• Support geospatial data
• Support project-specific data
archive and distribution policies
Huang - 01062010
Data Handlers
An application framework Plugin interface for product-
specific metadata handling and validation
Transforming product metadata into internal Submission Information Package (SIP)
Data discovery Local caching of data products
Huang - 01062010
Data Handlers - GHRSST
• Adaptation
– MMR validation and translation
– Data file validation
– Scans local/remote locations for new data
– Integration with back-end RDAC cluster
• Inventory
– Full migration from existing MySQL database
• Port to use the new data model
– FGDC and Index generators
– Website
• Adaptation
– MMR validation and translation
– Data file validation
– Scans local/remote locations for new data
– Integration with back-end RDAC cluster
• Inventory
– Full migration from existing MySQL database
• Port to use the new data model
– FGDC and Index generators
– Website
Huang - 01062010
The Group for High-Resolution Sea Surface Temperature (GHRSST)
Ingest and maintain interfaces to 52 GHRSST L2P/L3P/L4 datastreams from 10 Regional Data Assembly Center (RDAC)
~25GB/day >5000 granules/day
Realtime quality checking for data and metadata granules
Create Federal Geographic Data Committee metadata for daily collection granules
Distribution via FTP/OPeNDAP/POET
Maintain interfaces to the LTSRF for 30-day old data and metadata exchange
Data Handlers - ASCAT
• Adaptation
– Metadata validation and translation
– Data file validation
– Scans remote locations for new data
Dataset definition and policies
• Adaptation
– Metadata validation and translation
– Data file validation
– Scans remote locations for new data
Dataset definition and policiesHuang - 01062010
The Advanced SCATterometer (ASCAT)
Ingest and maintain interfaces to 2 L2 datastreams KNMI
~57 MB/day ~21 GB/year
Significant Event WS
Huang - 01062010
Significant Event Web
Huang - 01062010
DAAC in a Box?
Huang - 01062010
“premature optimization is the root of all evil.”
Donald Knuth“The Art of Computer Programming”
Huang - 01062010
Ingest
3 (36 parallel jobs)
Archive
3 (36 parallel jobs)
Purge
2 (20 parallel jobs)
21,254granules/day
4 seconds/granule
21,254granules/day
4 seconds/granule
Implementation
Optimization
Database Performance
Turning
Implementation
Optimization
Database Performance
Turning
Sample Performance
Huang - 01062010
ConclusionPO.DAAC DMAS
A system of RESTful webservicesScalablePortableExtensibleOperationally supports GHRSST and ASCAT
Future worksNew products: AquariusGHRSST GDS 2.0 metadata modelMigrationData subscription Administration tools
Huang - 01062010
Huang - 01062010
BACKUP SLIDES
FY ‘09 HighlightsWebservice ArchitectureData Ingestion and Archive WSDistributed Ingestion/Archive
EnginesLoad BalancingService MonitoringSignificant Event WSSuite of reusable componentsECHO publication Dataset and
Granule metadataGHRSSTASCAT L2
ASCAT
Huang - 09022009
Product SubscriptionEnable implementation of value-added
services
Archive ToolsMetadata Distribution
… can we build a data system with all these characteristics?
Scalable
Simple
Speed
Standardiz
e
Our Challenge
Huang - 09022009
• Load-Balance• Transaction-Oriented• On-The-Fly Deployment of
Engines• Dynamic support of new
data product• Scalable• State-Driven Job
Management
• Load-Balance• Transaction-Oriented• On-The-Fly Deployment of
Engines• Dynamic support of new
data product• Scalable• State-Driven Job
Management
DMAS – Ingestion and Archive Service
Huang - 09022009
DMAS – Significant Event Service
Huang - 09022009
Swath Tiler
MetadataSubmission
• Dataset subscriber• Trigger by newly
archived granules• Dispatch swath
tiling program• Submit tiling
metadata to NAIAD WS
• Dataset subscriber• Trigger by newly
archived granules• Dispatch swath
tiling program• Submit tiling
metadata to NAIAD WS
DMAS – Data SubscriberIntegration with NAIAD
Huang - 09022009
DMAS GoalsService tools
administrationproduct rolloutcontact management
New data subscription capabilityMaking DMAS the data hub - RSS feed,
automatic delivery of new granule, thumbnail generation… etc.
New dataset search capabilityevaluating VODC – ACCESS program
New data productsLegacy migration supportPlanning 4 DMAS releases
FY ’10
2System
Releases
(DMAS + T&S)
Huang - 09022009
Configuration ManagementHow to management
versions of third-party softwaredependency matrixupgrade to one or more third-party software
Standard development process between development teams change management software packaging dependency management
Standard build and deployment process
FY ’10
CM?
Huang - 09022009