mapping physical formats to logical models to extract data and metadata tara talbott ipaw ‘06

11
Mapping Physical Formats to Logical Mapping Physical Formats to Logical Models to Extract Data and Metadata Models to Extract Data and Metadata Tara Talbott IPAW ‘06

Upload: lesley-phillips

Post on 22-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06

Mapping Physical Formats to Logical Models Mapping Physical Formats to Logical Models to Extract Data and Metadata to Extract Data and Metadata

Mapping Physical Formats to Logical Models Mapping Physical Formats to Logical Models to Extract Data and Metadata to Extract Data and Metadata

Tara TalbottIPAW ‘06

Page 2: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06

2

The problem & solutionsThe problem & solutionsThe problem & solutionsThe problem & solutions

Wide range of files and formats Standard formats Prescriptive parsers Arbitrary formats

Machines need to merge, parse, and generally comprehend these various formats

Potential Solutions: Data must adhere to a pre-specified format Customized programs are written for each format and version Users describe the format of their data and use tools to convert the

data to a widely used and machine understandable format (e.g. XML)

Page 3: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06

3

Descriptive Parser solution- DFDLDescriptive Parser solution- DFDLDescriptive Parser solution- DFDLDescriptive Parser solution- DFDL

Data Format and Description Language

Uses XML schema with DFDL specific annotations to describe the underlying data how to transform it to logical model.

Example: “5, 9.35091E+02, 2.63227E+02, -6.20633E+07”

<step id="5">

<density unit="kg/m**3">935.091</density>

<temp>263.227</temp>

<pressure>-6.20633E7</pressure>

</step>

Page 4: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06

4

<element name=“step”><xs:annotation>

<xs:appinfo><dfdl:repType>text</dfdl:repType><dfdl:charset>UTF-8</dfdl:charset><dfdl:separator>,</dfdl:separator>

</xs:appinfo></xs:annotation><complexType>

<attribute name=“id” type=“xs:integer” use=“required”/><sequence>

<element name=“density” type=“xs:float”><complexType>

<attribute name=“unit” type=“xs:string” fixed=“kg/m**3”/></complexType>

</element><element name=“temp” type=“xs:float” /><element name=“pressure” type=“xs:float”/>

</sequence></complexType>

</element>

Example DFDL SchemaExample DFDL SchemaExample DFDL SchemaExample DFDL Schema

Page 5: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06

5

Defuddle Parser DesignDefuddle Parser DesignDefuddle Parser DesignDefuddle Parser Design

An implementation of the DFDL specification

Page 6: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06

6

CapabilitiesCapabilitiesCapabilitiesCapabilities

Basic Binary/text parsing of simple types Basic math operations Looping Conditional logic Use of regular expressions for separators and

terminators. Input from multiple data sources.

Advanced External translators Specify intermediate layers in the data which can be used

for processing, but are not reflected in the output

Page 7: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06

7

Parsing Complex FormatsParsing Complex FormatsParsing Complex FormatsParsing Complex Formats

Scientific formats that Defuddle capabilities have been demonstrated on: CHEMKIN solution file NWChem molecular dynamics property file NWChem electronic structure output file Microarray and Protein-Protein interaction spreadsheets Transformations within scientific workflows to avoid

custom programming

Other formats that we would like to see handled in the future… HDF, jpeg, etc.

Page 8: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06

8

What problems does Defuddle address?What problems does Defuddle address?What problems does Defuddle address?What problems does Defuddle address?Integrating different data formats, for collaboration of data generated before/without standardization.

Naming/identification of arbitrary file sub/super-structures

Long-term preservation and reading of data when the applications used to create it are no longer available.

Efficient, general data access capabilities Random access

Data Virtualization Multiple descriptions of the same data Using DFDL and DFDL-1 as general

subsetting/transformation mechanism

Metadata Extraction

Page 9: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06

9

Extracting metadataExtracting metadataExtracting metadataExtracting metadata

SAM DFDL+XSLT

Benefits of automatic provenance/annotation captureExample use: Microarray data – extracting header informationApplication to Provenance

Page 10: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06

10

DiscussionDiscussionDiscussionDiscussion

Challenges Efficient and Generic – Is it possible? Size Variable length text

Data Virtualization, providing an abstract view of the data, independent of underlying storage system Naming of data subsets, map name to reference of logical model, not

physical.

Eg: //step[5]/pressure

<step id="5"> …

<pressure>-6.20633E7</pressure>

</step>

Page 11: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06

11

Questions?Questions?Questions?Questions?

http://sdg.pnl.gov http://defuddle.pnl.govhttp://forge.gridforum.org/projects/dfdl-wg [email protected]