simdap simulation data access protocol
DESCRIPTION
SimDAP Simulation Data Access Protocol. Claudio Gheller CINECA ([email protected]). SimDAP in a nutshell. Simulation Data Access Protocol, hereafter SimDAP , defines a standard to access numerical simulation outputs ( theoretical data ), hereafter Snapshots. - PowerPoint PPT PresentationTRANSCRIPT
Garching, June 2008
SimDAP
Simulation Data Access Protocol
Claudio GhellerCINECA ([email protected])
Garching, June 2008
SimDAP in a nutshell
Simulation Data Access Protocol, hereafter SimDAP, defines a standard to access numerical simulation outputs (theoretical data), hereafter Snapshots.
The goal of the SimDAP protocol is to preview and retrieve data found in a previous search phase.
Since data could be huge, the SimDAP service can provide solutions to download ONLY the data of interest, reducing the communicated data volume.
The SimDAP protocol describes the interface to the “data shrink” services
The result of any SimDAP operation is a reference to one or more data files
Garching, June 2008
SimDAP examples
Search for simulations with Lambda>0.7
I like this one
It’s too large !!!
Let’s select a sub-region!!!
Metadata VOTable
Binary data file
Data is too large!!!
Extract a sub region… it is still large
Perform the analysis on-site
Finally I have a jpeg… cannot be too large!!!
Garching, June 2008
SimDAP target data
SimDAP deals with Snapshots
Generally speaking, Snapshots are RAW data produced by a numerical model
In principle any set of M physical parameters in a N-dimensional phase space can be the object of SimDAP
For simplicity, we have started considering data which represents a spatial distribution of phisical quantities, in different time steps. Therefore support of space and time are assumed by default.
E.g.:
o x, y and z coordinates of a set of particles at various evolutionary times
o The temperature over a computational mesh
o The x-ray luminosity derived from temperature and density (direct outcome of the simulations
Garching, June 2008
SimDAP data model
SimDAP adopts SimDB as the standard data model.
SimDB is essential in the discovery phase (not part of SimDAP), which provides the basic input parameters to SimDAP
o The experiment id (the simulation)
o The result id (the snapshots)
o The data provider id (as registered)
The result of the SimDAP operation is a reference to one or more file.
The result may be delivered as a VOTable describing, in terms of SimDB, the outcome of the SimDAP operation and containing the references to the data files
The data files presently have not a precise standard. Explorative work is in progress on this topic.
Garching, June 2008
SimDAP protocol
SimDAP does NOT specify anything about the implementation of the related services. This is up to the service provider.
SimDAP defines only the (standard) interface to the (web) service. This means that the following items will be standard:
o Service goal (what it does)
o Input parameters (what it is needed to run the service)
o Results (what is returned by the service)
Custom services are supported, BUT they must be fully described, possibly via registry
Garching, June 2008
SimDAP services
At present the following services are expected to be part of the protocol:
o Preview
o Download
o Cutout
o Custom
Each service MUST support a METADATA function which returns the input parameter supported by the service.
This information can be used either by the client applications (in particular for custom services) or by the registry, for users seaarching for services according to their capabilities.
Garching, June 2008
Preview: goals
The result of the SimDB search is a list of simulations and/or snapshots.
There is NO easy and/or standard way to understand if the content of the snapshot is fine for you.
However, you cannot download all the hits to check them.
The PREVIEW service allows you to have a pre-defined view of one or more snapshots.
Possible preview services can be based on:
o Selection and download of a subset of the whole snapshot (randomized, decimated…)
o visualization of the data, by 3D interactive rendering of sampled data, or orthogonal projections,
o statistical analysis
o Object catalogues (e.g. cluster of galaxies identified in a cosmological simulation)
o …
All these functionalities could act on precalculated infos or interactively.
Garching, June 2008
Preview: input parameters, result
The only mandatory input parameters are:
o Simulation id
o Snapshot id
Further parameters can be specified and published by the service. They allows the user to specify possible customization of the preview service.
If multiple preview functionalities are implemented, each is treated as a separate service.
The output is heterogeneous. If it is a file (decimated/reduced dataset), it must have the standard TVO format (VOTable+binary).
Garching, June 2008
Download: goals
Once the snapshots of interest have been identified, the user can decide to download them.
Two possible solutions:
o Direct download – the user get the data file as it is. This is
part of the SimDB protocol. No further actions are required on the data.
o SimDAP download – the user get back the snapshot in the standard TVO file format (VOTable+binary). A further operation may be supported and applied: fields selection. This operation allows the user to download only those physical quantities he is actually interested in.
Garching, June 2008
Download: input parameters, result
If only the direct download is available, the reference to the file is enough. However this is not strictly part of the SimDAP protocol.
The only mandatory input parameters are:
o Simulation id
o Snapshot id
The FIELD parameter has to be supported if the fields selection is available
Further parameters can be specified and published by the service. They allows the user to specify possible customization of the download service (e.g. automatic format or endianism conversions).
The output is always a file. The expected format is the TVO format (VOTable+binary), unless explicitly specified.
Garching, June 2008
Cutout: goals
o Data could be too large to be moved from the server.
o The user could be interested only in a small fraction of the data
The cutout service let the user to focus on a region of interest, extracting the corresponding data and downloading the resulting file.
In principle the cutout could be of any shape. For simplicity, SimDAP only deals with 3D rectangular selections, identified by a 3D point (a vertex or the center of the selection region) and the size of the selection box in the 3 coordinate directions.
The cutout can be completely different according to the data: regular meshes, AMR, point-like/unstructured data.
Search for simulations with Lambda>0.7
I like this one
It’s too large !!!
Let’s select a sub-
region!!!
Metadata VOTable
Binary data file
Garching, June 2008
Cutout: input parameters, result
The Cutout service requires different classes of inputs
Source parameters
o Simulation id
o Snapshot id
Physical quantities selection parameters:
o FIELD
Cutout fields parameters and corresponding units:
o COORD_X, COORD_Y, COORD_Z
o UNITS
Selection region parametes:
o VERT_X, VERT_Y, VERT_Z
o SIZE_X, SIZE_Y, SIZE_Z
Further parameters can be specified and published by the service. They allows the user to specify possible customization of the cutout service.
Garching, June 2008
The UNITS problem
The Cutout function requires the knowledge of the cutout units…
Example:
The user needs to extract all the data inside a simulated volume of a cosmological simulation. He wants to use “natural” units to identify the vertex position and the box size: Mpc
However data could be stored in different units (e.g. kpc or cm!!!).
In order to make the cutout possible two basic operations MUST be accomplished:
1. The server MUST “send” the units to the client (or conversion factors to some “natural” units);
2. The client, using the units, MUST convert the input parameters.
Garching, June 2008
Cutout tools
The Cutout function requires proper tools to select the region of interest.
The tools can be the same (or derived by) those used for the preview.
An example using VisIVO…
Garching, June 2008
Cutout results
The Cutout result is DATA.
The result data is characterized by raw data and metadata.
The latter are organized as a VOTable (in general, an XML file).
The VOTable describes the data and contains the acref parameter(s) to one (or more) file(s) containing the raw data.
The raw data could not be immediately available (access to secondary storage devices, CPU demanding operations…). In this case DATA STAGING is necessary.
Garching, June 2008
Custom Services and Service Registration
Custom services are supported. In this case the complete description of the service must be available as a registry entry
In general the SimDAP service is to be registered. This means:
o Publish information about the service name and owner
o Publish the URL of the service
o Publish the available services (preview, download, cutout, custom…)
Garching, June 2008
VOTables example: 1
VOTable for the velocity field of a fluid on a fixed 3D mesh
<RESOURCE name="myVectorField" type="results" >
<DESCRIPTION>Velocity Field from N-Body run</DESCRIPTION>
<INFO name="QUERY_STATUS" value="OK"/>
<TABLE name="VelocityField" ID="Vel" order="sequential” arraysize="41x41x41" >
<FIELD name="vx" ID="vx1" ucd="phys.veloc;pos.cartesian.x" datatype="float" unit="km/s" />
<FIELD name="vy" ID="vy1" ucd="phys.veloc;pos.cartesian.y" datatype="float" unit="km/s"/>
<FIELD name="vz" ID="vz1" ucd="phys.veloc;pos.cartesian.z" datatype="float" unit="km/s"/>
<DATA>
<BINARY>
<STREAM acref="file:///scratch/myhome/test.bin"/>
</BINARY>
</DATA>
</TABLE>
</RESOURCE>
</VOTABLE>
Garching, June 2008
VOTables example 2
VOTable for the temperature field of a mesh based quantity and the position
of N-Body particles extracted from the same spatial region. <RESOURCE name=myMixedData type="results"> <INFO name="QUERY_STATUS" value="OK"/> <TABLE name="Particles" ID="NBody" order="sequential” arraysize="100000"> <FIELD name="x" ID="x1" ucd="pos.cartesian;pos.cartesian.x“ datatype="float" unit="Mpc" /> <FIELD name="y" ID="y1" ucd="pos.cartesian;pos.cartesian.y“ datatype="float" unit="Mpc"/> <FIELD name="z" ID="z1" ucd="pos.cartesian;pos.cartesian.z” datatype="float"unit="Mpc"/> <DATA> <BINARY> <STREAM href="file:///scratch/myhome/particles.bin"/> </BINARY> </DATA> </TABLE>
<TABLE name=“Mesh" ID=“MeshTemp" order="sequential” arraysize=“41x41x41"> <FIELD name="temperature" ID="temp" ucd="phys.temperature;pos.cartesian“ datatype="float"
unit="K" /> <DATA> <BINARY> <STREAM href="file:///scratch/myhome/mesh.bin"/> </BINARY> </DATA> </TABLE></RESOURCE></VOTABLE>
Garching, June 2008
Raw data file formats
Data file formats can be different according to their usage
Archive side files should be
• High performance (fast access)
• Standard (portable and persistent)
Result files should be
• Simple (specific I/O libraries are not required to access them)
• Self descriptive (e.g. XML metadata headers)
• Compressed (to minimize transfer effort)
In any case, data size is crucial. ASCII files are “deprecated”. Base64 (or similar) encoding for http transfers are to avoid. Waist of time (for conversions) and “space” (increased size).
Garching, June 2008
Result files
A simple solution is represented by raw binary files with the following characteristics.
• In a file more variables can be stored
• Each variable represent a scalar quantity
• Components of multidimensional quantities are stored as separate variables
• Variables have the same number of elements but they can have different types
• Variables can be stored either as Tabular or as Sequential (see next slide)
• A descriptor file (XML) is associated to the binary to make it self-descriptive
Advantages: (little) standardization, simplicity, no I/O specific libraries required, fast access
Drawbacks: limited portability (endianism problem, data types), little standardization, no compression
Garching, June 2008
Result files: Tabular vs Sequential
Tabular files are closer to observational data, so more compatible to a standard VOTable idea.
If the file contains the 3 variables vx, vy, vz, their Tabular storage is:
vx(1), vy(1), vz(1)
vx(2), vy(2), vz(2)
…
vx(N), vy(N), vz(N)
This is suitable for variables (like the components of a vector) which are always accessed as N-uple. Or for data analysis tools which need (and load) all the stored variables for their goal.
However it leads to poor performances if variables has to be loaded separately in memory. Loading one variable requires continuous jumps on the file.
Garching, June 2008
Result files: Tabular vs Sequential (cont.ed)
Sequential files are a common choice for “simulators”
If the file contains the 2 variables rho and press, their Sequential storage is:rho(1)
rho(2)
…
rho(N)press(1)press(2)…press(N)
Each variable can be read with a single I/O call. This leads to high performance access to the file. This is typically required dealing with large files.
Garching, June 2008
Archive files
Archive files are not “visible” to the end user. Therefore the data provider can choose any suitable format.
The choice should be in general driven by several properties:
• The format should be standard and well supported, in order to ensure the preservation of the data, their portability between different computing platforms, software, compilers... (if the technology changes we don’t want to change the data)
• The files should be fast and efficiently accessible, since data is large and complex operations could be necessary to handle it (e.g. extract the particles which falls in a certain region)
Various formats, with such features, are available.
Garching, June 2008
File formats: HDF5
HDF5 (http://hdf.ncsa.uiuc.edu) represents a possible solution to deal with such data
HDF5 is• Portable between most of
modern platform• High performance• Well supported• Well documented• Rich of tools• Flexible and extendible
HDF5 data files are• Platform independent (portable)• Well organized• Self defined• Metadata enriched• Efficiently accessible
HDF5 drawbacks• Requires some expertise and
skill to be used• Information are difficult to
access• Can be subject to major library
changes (see HDF4 to HDF5)
Garching, June 2008
VisIVO Server Services for TVO
TVO archive
VisualizationWeb Services
Customizable data view
Garching, June 2008
Visualization Web Service
VisIVO Web Service has been realized using the SOAP engine AXIS.
VisIVO Server
You can write a client application using JAVA or C++
The ITVO web portal is a client application
The service implements a data staging mechanism for the VisIVO Server outcomes. (.png files)
Garching, June 2008
Developer guidelines: web services
The ITVO web portal describes the web service classes using Class Diagrams and publishing the JAVA code
ITVO Web Services are free software: you can redistribute them and/or modify uthem under the terms of the GNU General Public License V3