emerging provenance/context content standard discussion at data preservation and stewardship cluster...

Emerging Provenance/Context Content Standard

Discussion at Data Preservation and Stewardship Cluster Session at ESIP Federation Meeting

July 14, 2011

H. K. “Rama” Ramapriyan and John MosesESDIS Project

NASA Goddard Space Flight [email protected]; [email protected]

mailto:[email protected]

mailto:[email protected]

Outline• Why do we need a standard? (reaffirm)• Background• Status• Content Matrix• Next steps

– Draft introductory text material– Send matrix for broader review – NASA missions teams –

satellite and aircraft investigations, MEaSUREs teams, DAAC UWGs, USGS, NOAA, EPA(?), ESA(?)

– Decide on IEEE or ISO route to proceed further

Why Standard?• Need to understand and document what content is essential for

long-term preservation of information from Earth observations• Common approach and consistency across organizations (national

and international) will be beneficial to ensure future long-term archives preserve necessary content– Data needed for long-term science studies come from multiple organizations

• Standard will enable new missions to plan for preservation of required content – Less expensive to plan ahead than retrofit (retrofitting may be impossible)

• Assessing compliance to standard helps determine utility of datasets for long-term science studies– Standard is worth having even if compliance is not 100% (depends on

capabilities, requirements, budgets of organizations)• Standard should provide priorities for content – e.g., critical, essential , desirable

• Other Reasons??

Background• Proposed initiating standard activity at January 2011 ESIP

Meeting – Attendees were very supportive of the idea– Agreed we should develop a matrix – candidate columns

defined– Don’t know of an existing standard that covers enumeration of

all content that must be preserved• May be some overlaps with ISO 19115, but need to identify them item by item

• We are focused on “what” not “how”• Current title of standard: “Provenance and Context

Content Standard for Data Supporting Global Change Research”– We propose simplifying to: “Preservation Content Standard for

Data Supporting Global Change Research”

Status (1 of 2)• Received inputs from Ted Habermann et al (NOAA)• Merged with NASA’s inputs based on USGCRP

Workshop (1998) Report, and discussions (2010-2011) with EOS instrument teams (GLAS, HIRDLS) and TOMS instrument PI– Note – USGCRP workshop, jointly sponsored by NASA and NOAA,

identified a number of scenarios from which content recommendations were derived

• Content matrix was developed and posted on ESIP Data Stewardship and Preservation Cluster wiki– Initial version – March 1, 2011– Latest version (incorporates comments from cluster members) – June

8, 2011– Focused mostly on satellite remote sensing data; need to ensure we

cover other types of data (aircraft, in situ)

Status (2 of 2)

• ESA/NASA Earth Science Framework for Cooperation– NASA and ESA are discussing collaboration and/or coordination in

various areas, including data systems– Subgroup 3 - Ground Segments & Data

• Goals: Collaborate between ESA and NASA ground segments and data systems to enhance mission return and to enable efficient development and generation of multi-mission/multi-agency data products.

– ESA is very interested in coordinating the Content Standard with us – ESA has provided us with their documents related to Long-Term Data Preservation (LTDP)

– Provides an opportunity to compare notes, be more comprehensive and avoid duplication – could ease the way to international standard

– We have compared our matrix with ESA documents and identified only a few differences in content

– We will share the matrix with ESA after this meeting

Summary Earth Science Content Information

• OAIS Ref Model terms: Content Information Object includes Content Data Object with Representation Information for the Designated Community – Earth Science researchers.– Preservation Description Information Types: Provenance (source of info),

Context (relation to other info), Reference (identifiers), Fixity (e.g., checksum)

– PCCS is not covering Reference and Fixity explicitly – should it?

• PCCS categories are defined to highlight the important information needed for Earth Science research.– To identify the flow of information through an instrument’s life cycle.– To develop a user’s sense of what should be included as provenance and

context information about a particular instrument’s dataset.– Extendable to classes of instruments and measurement systems

Content Matrix – Introduction (1 of 3)• Using column headings discussed at January 2011 ESIP

meeting (mostly)• Each row corresponds to a content item and provides details• Content items are mapped into 8 categories (see later chart) • One or more content items are defined in each of the

categories • Column headings

– Item Number (C.N – category and number within category)– Category – Content Item Name– Definition / Description– Rationale (why content is needed)– Criteria (how good content should be)– Priority (H, M, L or critical, essential , desirable)

Content Matrix – Introduction (2 of 3)• Column headings (cont.)

– Source (who should provide content item)– Project phase for capture– User community (who would be most likely to need the

content item – this column is mostly blank in this version; needs group inputs)

– Representation (while focus in on “what”, brief comments are included here on whether items are word files, numeric files, pointers, etc.)

– Distribution restrictions (potential proprietary or ITAR concerns associated with content item)

– Source identifying item (where content item came from – NASA, NOAA or both)

Content Matrix – Introduction (3 of 3)• Further level of detail is possible - components

within content items– Different subsets of components are needed

under different circumstances– Discriminators that determine such subsets:

• Platform type (satellite, aircraft, flux tower, buoy, etc.)• Instrument type• Measurement type • Product level

Categories

1. Preflight/Pre-Operations: Instrument/Sensor characteristics including pre-flight/pre-operations performance measurements; calibration method; radiometric and spectral response; noise characteristics; detector offsets

2. Products (Data): Raw instrument data, Level 0 through Level 4 data products and associated metadata

3. Product Documentation: Structure and format with definitions of all parameters and metadata fields; algorithm theoretical basis; processing history and product version history; quality assessment information

4. Mission Calibration: Instrument/sensor calibration method (in operation) and data; calibration software used to generate lookup tables; instrument and platform events and maneuvers

5. Product Software: Product generation software and software documentation6. Algorithm Input: Any ancillary data or other data sets used in generation or

calibration of the data or derived product; ancillary data description and documentation

7. Validation: Record and data sets8. Software Tools: product access (reader) tools.

1. Pre-OperationalContent Name

Description Discriminator/Components

Priority

Instrument Description

Documentation of Instrument/sensor characteristics including pre-flight or pre-operational performance measurements.

Instrument Type• Remote sensors

• Specifications• Platform geometry• Spectral response• Radiometric response• Noise characteristics

• In-Situ

H

Calibration Data

Numeric (digital) files of Instrument/sensor characteristics including pre-flight or pre-operational performance measurements.

Instrument Type• Measured spectral response

data• Measured radiometric data• Measured ambient noise

L-M

2. Earth Science Data Products (1)Content Name


Raw data or Level 0 data products

Raw data as measured by a spaceborne, airborne or in situ instrument; Level 0 data is the reconstructed, unprocessed instrument and payload data at full resolution, with any and all communications artifacts removed.

Instrument Type• Observations

platform ephemeris and metadata in multiple files

L-M*

Level 1A data products

Reconstructed, unprocessed instrument data at full resolution, time-referenced, and annotated with ancillary information, including radiometric and geometric calibration coefficients and georeferencing parameters computed and appended but not applied to Level 0 data.

Instrument Type L-H*

Level 1B data Level 1A data that have been processed to sensor units.

Instrument Type H

* Priorities depend on choice to keep L0 or L1A

2. Earth Science Data Products (2)Content Name


Level 2 data Derived geophysical variables at the same resolution and location as Level 1 source data.

Instrument Type• Remote Sensor• In-Situ

H

Level 3 data Variables mapped on uniform space-time grid scales, usually with some completeness and consistency.

Assimilation Type• Single Level 2 Input • Multi Level 2 Inputs

H

Level 4 data Model output or results from analyses of lower-level data.

Model Input Type• Single Instrument• Multiple Instrument• Remote Sensor + In-

Situ

H

Metadata Information about data to facilitate discovery, search, access, understanding and usage associated with each of the data products. Links product to algorithm version.

Product Data LevelInstrument TypeAssimilation TypeModel Input Type

H

3. Product Documentation (1)Content Name Description Discriminator/

ComponentsProduct Team State the product team members

roles, contact information and period of responsibility.

Team Type• Investigator• Development• Help Desk• Operations

H

Product Requirements

Project's requirements for each product, either explicitly or by reference to the project's requirements document, if available.

Requirements Type• Content• Format• Latency• Accuracy• Quality

M

Product Dev. History

Major product development steps and milestones, with links to other relevant items that are part of the preserved provenance and context contents.

Instrument TypeAssimilation TypeModel Input Type

M


ComponentsProcessing History

Documentation of Processing history and production version history, indicating which versions were used when, why different versions came about, and what the improvements were from version to version.

Instrument TypeProduct Level

H

Algorithm Version History

Processing history including versions of processing source code corresponding to versions of the data set or derived product held in the archive. Granule level metadata should indicate which version of software was used for producing a given granule. Links Product to Algorithm Version.

Instrument TypeProduct Level

H

3. Product Documentation (3)Content Name


Maintenance History

Excerpts and/or references to maintenance documentation deemed of value to product users. Maintenance reports.

M

Operations History

Excerpts and/or references to operations documentation deemed of value to product users. Operations event logs.

H

Product Generation Algorithms

Processing algorithms and their scientific and mathematical basis, including complete description of any sampling or mapping algorithm used in creation of the product - geo-location, radiometric calibration, geophysical parameters, sampling or mapping algorithms used in creation of the product, algorithm software documentation, ATBD & high-level data flow diagrams

Product Level• Algorithm

Output• Algorithm

Performance Assumption

• Error Budget• Numerical

Computation Considerations

H


ComponentsProduct Quality Documentation of product quality assessment

(methods used, assessment summaries for each version of the datasets) Description of embedded data at the granule level including quality flags, product data uncertainty fields, data issues logs, etc.

Instrument TypeProduct Level• Product

Accuracy• Sensor

Effects

H

Quality Assessment and Potential Algorithm improvements

Describe potential future enhancements to the algorithm, the limitations they will mitigate, and provide all possible and useful related information and links.

Instrument Type H

References A bibliography of pertinent Technical Notes and articles, including refereed publications reporting on research using the data set.

H

User feedback Information received back from users of the data set or product

L

Product Generation AlgorithmsComponent Description

Algorithm Output Describe the output data products - not format - at a level of detail to determine if the product meets user requirements.

H

Algorithm Performance Assumptions

Describe all assumptions that have been made concerning the algorithm performance estimates. Note any limitations that apply to the algorithms (e.g., conditions where retrievals cannot be made or where performance may be significantly degraded. To the extent possible, the potential for degraded performance should be explored, along with mitigating strategies.

H

Error Budget Organize the various error estimates into an error budget. Error budget limitations should be explained. Describe prospects for overcoming error budget limitations with future maturation of the algorithm, test data, and error analysis methodology.

H

Numerical Computation Considerations

Describe how the algorithm is numerically implemented, including possible issues with computationally intensive operations (e.g., large matrix inversions, truncation and rounding).

M

Product QualityComponent Description

Product Accuracy Accuracy of products, as measured by validation testing, and compared to accuracy requirements. References to relevant test reports.

H

Sensor Effects Flowed-through effects of sensor noise, calibration errors, spatial and spectral errors, and/or un-modeled or neglected geophysical phenomena on the quality of products.

H

4. Mission Calibration (1)Content Name Description Discriminator/

ComponentsInstrument/Sensor Calibration during mission

Instrument/sensor calibration method - Radiometric calibration; Spectral response/ calibration; Noise characteristics; Geo-location

Instrument Type H

In-situ measurement environment

In the case of Earth based data, station location and any changes in location, instrumentation, controlling agency, surrounding land use and other factors which could influence the long-term record

Platform Type• Aircraft• Balloon• Station/

Tower/Buoy

H

Mission Platform History

Instrument events and maneuvers; attitude and ephemeris; aircraft position; event logs

Platform Type• Satellite• Aircraft

H

4. Mission Calibration (2)Content Name Discriminator/

ComponentsDiscriminator/Components

Mission Calibration data

Instrument/sensor calibration data - Radiometric calibration; Spectral response/ calibration; Noise characteristics; Geo-location

Instrument Type M

Calibration software

Source code used in applying calibration to generate look-up tables and/or parameters needed for producing calibrated products

Instrument Type M

5. Product Software (1)Content Name Description Discriminator/

ComponentsProduct generation algorithms

Source code used to generate products at all levels.

Instrument Type

H

Output dataset description

For each output data file, details on data product's structure, format/type, range of values and special error values. Include data volume and file size. All information needed to verify that the required output data is created by a run. Verify that all expected datasets are product in the expected format.

H

Programming & Procedural

Describe any important programming and procedural aspects related to implementing the algorithm into operating code.

M-H

Exception Handling

List the complete set of expected exceptions, and describes how they are identified, trapped, and handled.

H

5. Product Software (2)Content Name Description

Test Data Description

Description of data sets used for software verification and validation, including unit tests and system test, either explicitly or by reference to the developer's test plans, if available. This will be updated during operations to describe test data for maintenance.

L-M

Unit Test Plans Description of all test plans that were produced during development, including links or references to the artifacts.

L-M

Test Results Description of testing and test results performed during development, either explicitly or by references to test reports. If test reports are not available to external users, provide a summary of the test results in sufficient detail to give external users a good sense of how the test results indicate that the products meet requirements.

L-M

6. Algorithm InputContent Name Description

Algorithm input documentation

Complete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product, either explicitly or by reference to appropriate documents. Information should include full description of the input data and their attributes covering all input data used by the algorithm, including primary sensor data, ancillary data, forward models (e.g. radiative transfer models, optical models, or other model that relates sensor observables to geophysical phenomena) and look-up tables.

H

Algorithm Input data

At granule level, include information on all inputs (including ancillary or other data granules, calibration files, look-up tables etc.) that were used to generate the product.At the appropriate level (granule or dataset) include calibration parameters, precision orbit & attitude data; Climatological norms, geophysical masks or first guess fields, spectrum and transmittance informationNumerical weather or climate model inputs

H

7. ValidationContent Name Description Discriminator/

ComponentsValidation datasets Description of validation process,

including identification of validation data sets; Cal/Val plans & status; detailed history of validation activities

Product Level L-H

Validation record Validation data sets along with metadata

Product Level H

8. Software ToolsContent Name Description Discriminator/

ComponentsTools for Users Readers and data analysis tools Product Level H

Next Steps (1 of 2)• Review content matrix in this meeting – consider

appropriateness of level of detail• Form small team (4 or 5 volunteers willing to review

drafts in detail) to carry forward; incorporate results from this meeting on use cases

• Look at ISO19115, SensorML, Policy changes needed(?),• Draft introductory text material to include

– Need for content standard– ESIP Data Stewardship and Preservation Cluster – definition/mission/charter– Cluster’s motivation to promote standard– Scope of standard– Description of matrix– Request for review

• include how review inputs should be provided• Reviewers should look at the list from the points of view providing data as well as using

someone else’s data

Next Steps (2 of 2)• “user test” on a small group of data providers• Send matrix with introductory text for broader

review – NASA missions teams – satellite and aircraft investigations, MEaSUREs teams, DAAC UWGs, USGS, NOAA Data Centers, NOAA CDR Program PI’s, EPA(?), ESA(?)

• Provide same material to NASA ESDSWG for consideration with its “best practices” submission

• Decide on IEEE or ISO route to proceed further – may need to augment small team at this point

Back-up Charts

Proposal from NASA ESDSWG Technology Infusion Working Group (TIWG)

• Concept of a provenance and context standard was discussed at the ESDSWG meeting in October

• Tentatively agreed to develop a “Best Practice” Technical Note for NASA Standards Process Group consideration– Contents to include

• Items from the USGCRP 1998 Workshop Report• 1-2 paragraph justification and best practice for each item

– Process• Develop Technical Note • Submit RFC to SPG• SPG initial screening• Public evaluation

– Process expected to complete in about 1 year

IEEE Standard Development Process (~2 years)*• Form nucleus of a working group (core group of people interested in developing the standard) • Submit Project Authorization Request (PAR) to New Standards Committee (NesCom).

– Relatively short form specifying the scope, purpose, and contact points for the new project.

– NesCom meets 4 times a year, but can start the review of the par shortly after it is submitted • Mobilize Working Group

– Individual-based IEEE working groups (one individual-one vote) are open to anyone to participate - participants don't have to be IEEE or IEEE-SA members.

– Entity-based IEEE working groups (one entity-one vote) have specific membership requirements for an entity to observe or attain membership and voting rights. There are legal issues related to patents and such that must be dealt with, but the IEEE Standards Association (SA) provides staff support to help with this.

• Draft the Standard– IEEE-SA staff provides many tools and direct support in drafting the standards– Involvement of the Quantities, Units and Letter Symbols Committee, and the IEEE-SA Staff Editor are mandatory.

• Ballot the Standard– When working group has determined that a draft is mature enough, they submit it to the IEEE sponsor (SCC40, chaired by SJS Khalsa).

– 75% affirmative ballots are needed to proceed to the approval stage • Approve the Standard

– Final approval of an IEEE standard is achieved by submitting document and supporting material to IEEE-SA Standards Board Standards Review Committee (RevCom), which issues a recommendation to the IEEE-SA Standards Board.

• Maintain the Standard– A standard has a validity period of five years from the date of approval by the IEEE-SA Standards Board. It is important to note that at the end of

the five year validity period, one of three things has to happen: – revision, reaffirmation, or withdrawal.– During this five year validity period, amendments and corrigenda may need to be developed that offer minor additions/revision to the standard.

*Thanks to Siri Jodha Singh Khalsa

ISO Standard Development Process (~4 years)*• An INCITS/L1* member or members prepares a new work item proposal (NWIP)• NWIP will be voted by INCITS/L1 members for a one month voting period during which clarification questions need

to be answered• If NWIP is approved by INCITS/L1, it will be voted by INCITS Executive board as U.S. contribution will be sent to ISO

TC211 secretariat by INCITS/L1. TC 211 will send the proposal to member countries for one month informal review• If more than half of voting members countries support the project and at least 5 members agree to participate in

the project by nominating experts to the project team, the project is approved• Project chair, editor, and the nominated experts will form the project team to draft the working draft (WD) of the

standard . There are normally multiple versions of the WD. • When the project team is satisfied with the WD, the final version of WD will be sent out by TC211 to countries’

members for Committee Draft (CD) voting.• If CD voting approved, the project team is dissolved and an editing committee (EC) will be formed. The project chair

will be the chair of the EC, and the ISO TC 211 working group 6 convener will become the EC chair.• Edited CD will be put in for vote again to advance to Draft International Standard (DIS) stage. If approved, the EC

will have a final EC meeting to edit the document to DIS based on voting comments. If not approved, a new version of CD needs to be produced.

• After it becomes DIS, the next step will be Formal Draft International Standard (FDIS). • FDIS to International standard (IS): It will automatically advance from FDIS to IS after 6 months without major

objections. Only minor editorial change will be allowed from FDIS to IS.

**International Committee for Information Technology Standards – Geographic Information Systems

*Thanks to Liping Di

emerging provenance/context content standard discussion at data preservation and stewardship cluster...

Documents

preservation content

data preservation

necessary content data

existing standard

content recommendations

esip data stewardship

derived content matrix

data systems subgroup