aukeggs canberra, 2006-11-29 exposing legacy file-based data (interop-for-files) andrew woolf cclrc...
TRANSCRIPT
AUKEGGS
Canberra, 2006-11-29
Exposing legacy file-based data(interop-for-files)
Andrew WoolfCCLRC Rutherford Appleton Laboratory
AUKEGGS
Canberra, 2006-11-29
Outline
• Introduction
• The feature model as integration key
• An interoperability approach for files
• xlink review and proposed profile for legacy data
• Examples
• Issues
AUKEGGS
Canberra, 2006-11-29
Introduction
• Much ‘earth-science’ data exists as large legacy file-stores– e.g. ECMWF: 2 Pb of file-based data– e.g British Atmospheric Data Centre: 40 Tb of file-
based data
• Interoperability demands common approaches• BUT, multitude of formats masks commonality
– netCDF, HDF4, HDF5, GRIB, NASA Ames, PP, ...
AUKEGGS
Canberra, 2006-11-29
Introduction
• File-centred data management focusses on the container rather than content
• File API is fundamental point of reference– binary format details not always exposed or
guaranteed– public API may be only supported access mechanism– often implemented as performant optimised native
library
• Conclusion: can’t/shouldn’t migrate
AUKEGGS
Canberra, 2006-11-29
• Want to expose information, not format...
Introduction
AUKEGGS
Canberra, 2006-11-29
Introduction
• Information structures may be composed across files
AUKEGGS
Canberra, 2006-11-29
The feature model
• Common pattern with file-data:– need to integrate information structures
across multiple files– (relational tables provide this implicitly)
• Semantics provide an integration key– e.g. an oceanographer and meteorologist can
share a conversation about data despite format differences
AUKEGGS
Canberra, 2006-11-29
The feature model
AUKEGGS
Canberra, 2006-11-29
A model for file-based interoperability
• Retain file-based persistence format• Supplement with feature-based conceptual
model• ‘Cast’ legacy data onto conceptual model
– interoperableData = (featureModel) legacyData
• Legacy file data + GML-encoded conceptual ‘metadata’ = ‘interoperable view’– may be exposed through W*S
AUKEGGS
Canberra, 2006-11-29
A model for file-based interoperability
• GML provides conceptual feature ‘skeleton’
• File provides ‘flesh’
• GML ‘by-reference’ pattern for property values– uses simple xlink– “The value of a GML property that carries an xlink:href attribute is the resource returned by traversing the link”
AUKEGGS
Canberra, 2006-11-29
xlink review
extended xlink [role] [title]
local resource D[role][title][label]
remote resource C[href][role][title][label]
remote resource B[href][role][title][label]
local resource A[role][title][label]
arc 1[arcrole] [title]
[show] [actuate]
arc 2
arc 3
AUKEGGS
Canberra, 2006-11-29
xlink review
simple xlink [role] [title]
local resource[role][title][label]
remote resource[href][role][title][label]
arc[arcrole] [title]
[show] [actuate]
AUKEGGS
Canberra, 2006-11-29
xlink review
• ‘role’ (URI):– indicates a property of the remote resource– must be a URI reference that “identifies some
resource that describes the intended property”
• ‘arcrole’ (URI):– describes the “meaning of the arc’s ending
resource relative to its starting resource”– corresponds to RDF notion of a property
• starting-resource HAS arc-role ending-resource
AUKEGGS
Canberra, 2006-11-29
extended xlink
xlink patterns for files
GML feature instance
Aggregation semantics determined by xlink arc traversal rules
AUKEGGS
Canberra, 2006-11-29
simple xlink
xlink patterns for files
GML feature instance
Aggregation semantics determined by storage descriptor
AUKEGGS
Canberra, 2006-11-29
xlink proposal
• href examples:– netCDF#variable– RDBMS#SQLQuery– GRIBFile#recordNumber– CSMLStorageDescriptor#arrayID
<someGMLElement
xlink:arcrole="hasRemoteContentEmbeddedAt#localXpath"
xlink:href="storageDescriptor#portion"
xlink:role="storageSchemaIdentifier"
xlink:show="embed"
xlink:actuate="onRequest | onLoad"/>
AUKEGGS
Canberra, 2006-11-29
Example
• GML CR 06-160– ISO 19123
CV_ReferenceableGrid
<gml:ReferenceableGrid gml:id="ID001" srsName="urn:ogc:def:crs:EPSG:6.6:4326" dimension="2"> <gml:limits> <gml:GridEnvelope> <gml:low>0 0</gml:low> <gml:high>7 4</gml:high> </gml:GridEnvelope> </gml:limits> <gml:axisLabels>x y</gml:axisLabels> <gml:coordTransformTable> <gml:GridCoordinatesTable> <gml:gridOrdinate> <gml:GridOrdinateDescription> <gml:coordAxisLabel>Geodetic longitude</gml:coordAxisLabel> <gml:coordAxisValues> <gml:SpatialOrTemporalPositionList> <gml:coordinateList>13.5 24.9 32.4 37.7 41.5 46.8 54.4 65.7</gml:coordinateList> </gml:SpatialOrTemporalPositionList> </gml:coordAxisValues> <gml:gridAxesSpanned>x</gml:gridAxesSpanned > <gml:sequenceRule axisOrder="+1">Linear</gml:sequenceRule> </gml:GridOrdinateDescription> </gml:gridOrdinate> <gml:gridOrdinate> <gml:GridOrdinateDescription> <gml:coordAxisLabel>Geodetic latitude</gml:coordAxisLabel> <gml:coordAxisValues> <gml:SpatialOrTemporalPositionList> <gml:coordinateList>
53.1 48.7 46.2 44.7 43.9 43.3 43.1 44.046.2 43.2 41.5 40.6 40.2 40.0 40.3 41.737.1 36.1 35.6 35.5 35.7 36.0 37.1 39.530.4 30.2 30.4 30.7 31.1 32.0 33.8 37.224.3 24.8 25.3 26.0 26.6 27.7 29.7 33.4
</gml:coordinateList> </gml:SpatialOrTemporalPositionList> </gml:coordAxisValues> <gml:gridAxesSpanned>x y</gml:gridAxesSpanned > <gml:sequenceRule axisOrder="+1 -2">Linear</gml:sequenceRule> </gml:GridOrdinateDescription> </gml:gridOrdinate> </gml:GridCoordinatesTable> </gml:coordTransformTable> </gml:ReferenceableGrid>
AUKEGGS
Canberra, 2006-11-29
Example• netCDF ASCII dump:
netcdf myfile {dimensions:
x = 8 ;y = 5 ;
variables:float lon(x) ;
lon:long_name = “longitude” ;lon:units = “degrees_east” ;
float lat(x,y) ;lat:long_name = “latitude” ;lat:units = “degrees_north” ;
float temp(x,y) ;temp:coordinates = “lon lat” ;temp:long_name = “temperature” ;temp:units = “degC” ;
data: lon = 13.5, 24.9, 32.4, 37.7, 41.5, 46.8, 54.4, 65.7 ; lat = 53.1, 48.7, 46.2, 44.7, 43.9, 43.3, 43.1, 44.0, 46.2, 43.2, 41.5, ...
AUKEGGS
Canberra, 2006-11-29
Example<gml:gridOrdinate> <gml:GridOrdinateDescription> <gml:coordAxisLabel>Geodetic longitude</gml:coordAxisLabel> <gml:coordAxisValues> <gml:SpatialOrTemporalPositionList> <gml:coordinateList srsName=“WGS84”>13.5 24.9 32.4 37.7 41.5 46.8 54.4 65.7</gml:coordinateList> </gml:SpatialOrTemporalPositionList> </gml:coordAxisValues> <gml:gridAxesSpanned>x</gml:gridAxesSpanned > <gml:sequenceRule axisOrder="+1">Linear</gml:sequenceRule> </gml:GridOrdinateDescription></gml:gridOrdinate>
<gml:coordAxisValuesxlink:arcrole=“http://ndg.nerc.ac.uk/xlinkUsage/insert#SpatialOrTemporalPositionList/coordinateList”xlink:href=“myfile.nc#lon”xlink:role=“http://ndg.nerc.ac.uk/fileFormat/netcdf”xlink:show=“embed”> <gml:SpatialOrTemporalPositionList> <gml:coordinateList srsName=“WGS84”/> </gml:SpatialOrTemporalPositionList> </gml:coordAxisValues>
AUKEGGS
Canberra, 2006-11-29
Issues
• Need to ‘get as close as possible’ to target– ‘merge’ semantics consistent with GML?
(Opportunity: no best practice for GML yet!)• “If both a link and content are present in an
instance of a property element, then the object found by traversing the xlink:href link shall be the normative value of the property. The object included as content shall be used by the data recipient only if the remote instance cannot be resolved; this may be considered to be a "cached" version of the object.” [GML 7.2.3.4]
AUKEGGS
Canberra, 2006-11-29
Issues
• xlink:href (URI) for remote resource fragment (format-specific)– e.g. RDBMS#SQLQuery, netCDF#variable, etc...
• xlink:role (URI) for resource format– e.g. reference PRONOM-type format repository?
• implied conversion to GML target content type• xlink:arcrole (URI) for ‘embed remote content’ semantics
– ‘insert at relative XPath’ essential
• simple xlink can’t handle multiple resources– application-specific ‘storage descriptor’ schemas for file
aggregation semantics
AUKEGGS
Canberra, 2006-11-29
Conclusion
• Presented a profile for xlink with files in absence of current best practice
• Meets key practical requirements– retain file-based persistence formats– provide interoperability ‘wrapper’– focus on logical content, not container (feature model)
• Semantic governance at appropriate points• Enables powerful, scalable mechanism for real
data– e.g. large meteorological datasets