mapping physical formats to logical models to extract data and metadata tara talbott ipaw ‘06
TRANSCRIPT
![Page 1: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06](https://reader035.vdocuments.us/reader035/viewer/2022072008/56649d835503460f94a697e3/html5/thumbnails/1.jpg)
Mapping Physical Formats to Logical Models Mapping Physical Formats to Logical Models to Extract Data and Metadata to Extract Data and Metadata
Mapping Physical Formats to Logical Models Mapping Physical Formats to Logical Models to Extract Data and Metadata to Extract Data and Metadata
Tara TalbottIPAW ‘06
![Page 2: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06](https://reader035.vdocuments.us/reader035/viewer/2022072008/56649d835503460f94a697e3/html5/thumbnails/2.jpg)
2
The problem & solutionsThe problem & solutionsThe problem & solutionsThe problem & solutions
Wide range of files and formats Standard formats Prescriptive parsers Arbitrary formats
Machines need to merge, parse, and generally comprehend these various formats
Potential Solutions: Data must adhere to a pre-specified format Customized programs are written for each format and version Users describe the format of their data and use tools to convert the
data to a widely used and machine understandable format (e.g. XML)
![Page 3: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06](https://reader035.vdocuments.us/reader035/viewer/2022072008/56649d835503460f94a697e3/html5/thumbnails/3.jpg)
3
Descriptive Parser solution- DFDLDescriptive Parser solution- DFDLDescriptive Parser solution- DFDLDescriptive Parser solution- DFDL
Data Format and Description Language
Uses XML schema with DFDL specific annotations to describe the underlying data how to transform it to logical model.
Example: “5, 9.35091E+02, 2.63227E+02, -6.20633E+07”
<step id="5">
<density unit="kg/m**3">935.091</density>
<temp>263.227</temp>
<pressure>-6.20633E7</pressure>
</step>
![Page 4: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06](https://reader035.vdocuments.us/reader035/viewer/2022072008/56649d835503460f94a697e3/html5/thumbnails/4.jpg)
4
<element name=“step”><xs:annotation>
<xs:appinfo><dfdl:repType>text</dfdl:repType><dfdl:charset>UTF-8</dfdl:charset><dfdl:separator>,</dfdl:separator>
</xs:appinfo></xs:annotation><complexType>
<attribute name=“id” type=“xs:integer” use=“required”/><sequence>
<element name=“density” type=“xs:float”><complexType>
<attribute name=“unit” type=“xs:string” fixed=“kg/m**3”/></complexType>
</element><element name=“temp” type=“xs:float” /><element name=“pressure” type=“xs:float”/>
</sequence></complexType>
</element>
Example DFDL SchemaExample DFDL SchemaExample DFDL SchemaExample DFDL Schema
![Page 5: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06](https://reader035.vdocuments.us/reader035/viewer/2022072008/56649d835503460f94a697e3/html5/thumbnails/5.jpg)
5
Defuddle Parser DesignDefuddle Parser DesignDefuddle Parser DesignDefuddle Parser Design
An implementation of the DFDL specification
![Page 6: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06](https://reader035.vdocuments.us/reader035/viewer/2022072008/56649d835503460f94a697e3/html5/thumbnails/6.jpg)
6
CapabilitiesCapabilitiesCapabilitiesCapabilities
Basic Binary/text parsing of simple types Basic math operations Looping Conditional logic Use of regular expressions for separators and
terminators. Input from multiple data sources.
Advanced External translators Specify intermediate layers in the data which can be used
for processing, but are not reflected in the output
![Page 7: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06](https://reader035.vdocuments.us/reader035/viewer/2022072008/56649d835503460f94a697e3/html5/thumbnails/7.jpg)
7
Parsing Complex FormatsParsing Complex FormatsParsing Complex FormatsParsing Complex Formats
Scientific formats that Defuddle capabilities have been demonstrated on: CHEMKIN solution file NWChem molecular dynamics property file NWChem electronic structure output file Microarray and Protein-Protein interaction spreadsheets Transformations within scientific workflows to avoid
custom programming
Other formats that we would like to see handled in the future… HDF, jpeg, etc.
![Page 8: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06](https://reader035.vdocuments.us/reader035/viewer/2022072008/56649d835503460f94a697e3/html5/thumbnails/8.jpg)
8
What problems does Defuddle address?What problems does Defuddle address?What problems does Defuddle address?What problems does Defuddle address?Integrating different data formats, for collaboration of data generated before/without standardization.
Naming/identification of arbitrary file sub/super-structures
Long-term preservation and reading of data when the applications used to create it are no longer available.
Efficient, general data access capabilities Random access
Data Virtualization Multiple descriptions of the same data Using DFDL and DFDL-1 as general
subsetting/transformation mechanism
Metadata Extraction
![Page 9: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06](https://reader035.vdocuments.us/reader035/viewer/2022072008/56649d835503460f94a697e3/html5/thumbnails/9.jpg)
9
Extracting metadataExtracting metadataExtracting metadataExtracting metadata
SAM DFDL+XSLT
Benefits of automatic provenance/annotation captureExample use: Microarray data – extracting header informationApplication to Provenance
![Page 10: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06](https://reader035.vdocuments.us/reader035/viewer/2022072008/56649d835503460f94a697e3/html5/thumbnails/10.jpg)
10
DiscussionDiscussionDiscussionDiscussion
Challenges Efficient and Generic – Is it possible? Size Variable length text
Data Virtualization, providing an abstract view of the data, independent of underlying storage system Naming of data subsets, map name to reference of logical model, not
physical.
Eg: //step[5]/pressure
<step id="5"> …
<pressure>-6.20633E7</pressure>
</step>
![Page 11: Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06](https://reader035.vdocuments.us/reader035/viewer/2022072008/56649d835503460f94a697e3/html5/thumbnails/11.jpg)
11
Questions?Questions?Questions?Questions?
http://sdg.pnl.gov http://defuddle.pnl.govhttp://forge.gridforum.org/projects/dfdl-wg [email protected]