garching, april 2007 the simple numerical access protocol (snap) for theoretical data claudio...

Garching, April 2007

The Simple Numerical Access Protocol (SNAP)

for theoretical data Claudio GhellerCINECA ([email protected])


1. OVERVIEW


Simple Numerical Access Protocol - SNAP

Simple Numerical Access Protocol (hereafter SNAP) defines a standard to access numerical simulation outputs.

Data can be the outcome of different kinds of numerical applications.

However, SNAP is designed to address numerical simulation outputs organized as follows:

• For each timestep, the information must be sampled in a generic 3D space

• Positions in this volume are called x, y and z.

• The sampling can be regular (e.g. cartesian mesh) or irregular (e.g. particle or adaptive mesh position). Each mesh/particle position in the 3D space hosts the same physical quantity (i.e. mass, density, velocity, etc) for each timestep.

Ultimately, the sampling volume does NOT necessarily need to be geometric or even 3D. It could be any N-dimensional set of variables that can be used to perform a meaningful SNAP operation. Furthermore, also conditioned queries could be supported (e.g. extract data from a given region with temperature higher than a given value).

However, for simplicity, we start dealing with geometric 3D operations


SNAP main stages

1. Search for available simulations and data. The query is on metadata. The result is an XML document (maybe VOTable) with matching result metadata.

2. Identification of subset of interest. The user identifies and set a subset of the full simulation data which is of interest. This subset is defined both in time and in space.

3. Snap request. Send to the server the selection parameters for the Snap action

4. Data staging and delivery. Metadata and data are delivered (possibly after some time, needed for extraction) via HTTP, FTP as binary files.

5. Service registration SNAP services need to be published in available registry. Registry inquiry must be performed according to the SNAP data model

Search for simulations with Lambda>0.7

I like this one

It’s too large !!!

Let’s SNAP it !!!

Metadata VOTable

Binary data file


Data levels

Mimicing observational data, simulated data can be organized in 3 levels

Level 0: direct outcome of the simulation. Examples are the coordinates and velocities of files in an N-Body simulation, the density field on the computational mesh of a Jet simulation etc.

Level 1: data extracted or derived from the simulation results, having the same characteristics of the simulation results themselves. For example, the coordinates of the points that build up a galaxy cluster extracted from a cosmological simulation using a friend of friends algortihm

Level 2: results that have been obtained after an analysis process from Level 0 and Level 1 data. Examples are projected maps, statistical functions, Virtual Telescope applications.

SNAP deals with Level 0 and Level 1 data


SNAP and Data levels

The SNAP protocol deals with Level 0 and Level 1 data. It specifies the following services:

• retrieval of the entire simulation outcome (the particle positions and velocities within the simulation box, or the physical quantaties at each grid point) – known as a snapshot – at one or more timesteps (it is not simple download!!!)

• retrieval of a specific subset or subvolume of a simulation (e.g. all the particles/grid-points within a certain region)


SNAP in action: an example

http://www.astrocomp.it/itvo

Implemented by the Astrophysical Observatory of Catania (Becciani U., Costa A., Costa V.)

Developed according to first Theoretical Data Model Prototype together with a similar implementation in Trieste

Available:• Simulation discovery• VOTable download• Thumbnails• GetSnap

Ask Ugo for details…


Requirements for compliance

The SNAP service MUST be implemented according to a minimal set of characteristics. Each of the defined characteristics should be developed according to the specifications of the SNAP documents.

1. The service MUST support a Simulation Selection service. The SNAP service MUST provide tools to select the datasets and proceed with following steps of the SNAP procedure.

2. The SNAP service MUST support a getUnits method (or getFields method… to be discussed). This method allows clients to get the list of units associated to the available fields.

3. The Sub-Volume Extraction method SHOULD be supported. If supported, a getThumb method MUST be available.

4. The setSnap method MUST be supported. This method allows clients to submit a SNAP operation


Requirements for compliance (cont.ed)

6. The data retrieval (getSnap) method MUST be supported This method allows clients to retrieve single simulation snapshots and cutouts

7. The SNAP service MUST be registered by providing the information which describes the available functions. Registration allows clients to use a central registry service to locate compliant simulation access services and select an optimal subset of services to query, based on the characteristics of each service and the simulation data collections it serves.

8. Job management request methods, getSnapInfo, cancelSnap, MAY be supported. These methods allow users to inquire about the status of a submitted request and, possibly, to cancel it.


2. SIMULATION SELECTION

AND UNITS


Simulation discovery

Available simulations are returned as the result of a query based on a set of physical and technical parameters which to some degree specify the type of simulation of interest to the user.

These parameters can be general or specific to the discipline or research field of interest. The details of the search criteria and execution are not part of the SNAP protocol implementation


SNAP Model

the SNAP service MUST provide tools to select the datasets of interest and proceed with following steps of the SNAP procedure

Selection tools must be developed according to the SNAP data model presented by GL. Results must be described according to the same model:

http://www.ivoa.net/twiki/bin/view/IVOA/IVOATheorySimulationDatamodel


Data Units

Data are stored in the archives with specific units, which can be retrieved by the client by a getUnits() method.

The client can present the data in any suitable unit.

However, the client MUST convert any quantity in server-side units before submitting any request. E.g., the center of a computational volume of a N-body cosmological simulation can be specified in Mpc by the client, since this is familiar to the user. However, particle coordinates could be represented by the simulation code in the [0, 1] range. Therefore, the center position must be converted by the client to this internal representation.


getUnits method

In order to submit a getUnits request, the following parameters must be specified and passed to the server:

DATASERVICE and DATASOURCE

which specify the service and the data object (described later)

FIELDS

which specifies the quantities for which we need units. The parameter is described later.

The getUnits method returns a string in which units are listed in the same order of the quantities specified in the FIELDS parameter, e.g:

FIELDS=”xposition,yposition,zposition,velocity,temperature”

UNITS_RESULT=”Mpc,Mpc,Mpc,km sec-1,K”


getFields method

Alternatively a getFields() method could be implemented, which return both all the available fields and the corresponding units (in this case some unnecessary information could be communicated).

In order to submit a getFields request, the following parameters must be specified and passed to the server:

DATASERVICE and DATASOURCE

which specify the service and the data object (described later)

The getFields method returns a string in which fields and corresponding units are listed e.g:

FIELDS_RESULT=”xposition,Mpc,yposition,Mpc,zposition,Mpc,velocity,km sec-1,temperature,K”


3. SUBSET SELECTION


Subset selection

The Snap request must be submitted according to the prescription presented (and discussed) later. A data cutout service could be implemented. This allows the user to focus on interesting regions/features of the simulation and to select and download only the related data. If it is implemented, the service MUST provide tools to enable the client to specify the size and position of the subset.

In general, size and position can be any N-uple of parameters in a N-D phase space.

For simplicity we will focus on geometrical 3D examples.


The “thumbnail”

“A miniature representation of a page or image that is used to identify a file by its contents”

The thumbnail is a representative, but much smaller (with respect to the data size), realization of the whole dataset. Since the whole dataset is not downloadable or directly “usable”, this is a way to have a light interaction with the data

The thumbnail data and the associated tools should allow the user to perform all the necessary operations to select the region for the cutout

We cannot identify a unique thumbnail tool. It depends on the scientific field, on the data, its geometry, its meaning… In the following slides some examples…


Geometrical sampling

1. The thumbnail is a decimated set of data. Decimation can be obtained by random selection (e.g. for N-Body particles), averaging neighbour cells (mesh simulations)

2. The thumbnail preserve the dimensionality and geometry of the original dataset

3. An appropriate application (web based, visualization tool…) is used to set the selected region.


Projections and Cutting Planes

1. The thumbnail is a N-1 Dim representation of the dataset. For example the projection along the line of sights (e.g. column density in cosmological simulations), or cutting planes in interesting regions of the computational box (e.g. the main axis in a jet simulation)

2. The user application must provide tools to have multiple views of the data (e.g. orthogonal planes) and to select interesting regions


Selection algorithms

1. The thumbnail is a subset of the whole dataset, determined according to specific selection algorithms. E.g. the highest peaks of a mass distribution or the regions with temperature higher than a certain threshold…

2. The dimensionality of the data is the same (in general) as the original one.

3. The client tools must allow the user to chose some of the resulting objects and get the corresponding data


The getThumb method

The specific details of the thumbnail services depend on their implementation and they must be published to the registry. However, a minimal set of negotiation methods and interfaces can be defined.

A getThumb method MUST be implemented. The input of this method is a couple of DATASERVICE, DATASOURCE parameters which identifies the dataset of interest. The output is a SNAP VOTable (or XML descriptor file, see later…) describing the thumbnail features. As for the results of the Snap procedure, thumbnails data are stored in external binary files downloaded together with the VOTable, immediately (they are small) as a response to the method.

No other details can be a priori specified, since strongly depending on the service implementation


4. THE SNAP SERVICE


The SNAP service

The main target of the SNAP service is the access to the raw data of a simulation, selected by a general Simulation Query

The SNAP service in general provides the following functionalities:

1. Extraction of a subset of data selected in a rectangular or spherical volume

2. Storage of the associated metadata in a VOTable (or XML descriptor file)

3. Storage of data in a binary file

4. Delivery of the result to the user via http, ftp etc.

The extraction phase 1, allows the user to focus on regions of interest, without having to download the whole dataset. Nevertheless, retrieving the complete dataset is still possible.


The setSnap method

In order to submit the Snap request, a setSnap() method MUST be implemented, with parameters defined in the following slides.

To select the region of interest, only geometric parameters are necessary. For a rectangular region, the user has to specify the center of the box and the length of each of its sides. For a spherical selection, center and radius of the sphere are required. One or more variables of a given snapshot can be selected in the same cutout operation. Only one timestep corresponds to a setSnap request.


setSnap input parameters

An input Sub-Volume query consists of an x,y,z position in the box, plus the side lengths (or radius) of the rectangular (spherical) region surrounding this point. Units are decided by the client. Finally they must be converted by the client in server compatible units

The service MUST support the following parameters: POS

The position of the center of the region of interest, expressed in proper units. Example: "POS=0.3,0.25,0.9". A NULL value represents the center of the whole box (e.g. 0.5,0.5,0.5).

SIZE

The size of the sides (or the radius) of the region in proper units. The region may be specified using either one or three values. If only one value is given it represents the radius of the sphere. The format of the SIZE parameter is the same as that for POS. Example “SIZE=0.2,0.5,0.3”. A special case is SIZE=NULL, which represents the whole box.


setSnap input parameters (cont.ed)

The following parameters SHOULD be supported: BOUNDARY

Also this parameter can have one or three values, one for each coordinate direction. If only one value is given it applies to all coordinate axes. Possible values are:

• TRUNC – if the interesting region exceeds the computational box, it is resized at the box boundary

• PERIODIC - if the interesting region exceeds the computational box, data are selected from the opposite side of the box

Metadata of the service indicates whether periodic is supported.FIELDS

The service SHOULD support an optional parameter with the name FIELDS, the value of which is a comma separated list of field names corresponding to the data elements the simulation can return. If the parameter is not provided the default behavior is to return all fields. Example: “FIELDS=Density,Temperature,Velocity_x “


setSnap input parameter: data sources

Simulations outputs are stored in files. This files can be indicated by a reference name which identify unambiguously the data source. The data source can be also a database. However, this does not imply anything on the service interface implementation. The complexity of the database access is hidden behind the setSnap operation and its implementation. But this is up to the service provider.

DATASOURCEThe service MUST support a parameter with the name DATASOURCE, the

value of which is single data source reference. Examples: “DATASOURCE=/scratch/my_directory/myfile1.bin” “DATASOURCE=myfile2.ref”

The service id MUST also be specified. DATASERVICEIdentification of the data service (to be better specified)

A SNAP operation MUST refer to a single data source. Multiple sources cutouts, like for various time steps of the same simulation, cannot be supported by the protocol. Their implementation is up to the client, as, for example, sequences of single source requests with same subbox and fields. The client must verify that such operation is possible and/or meaningful.


setSnap input parameters: File Formats etc.

The SNAP service deliver its results as VOTables (or XML descriptor file) with associated binary files.

The service MAY support a parameter with the name FORMAT to indicate the desired format or formats of the data referenced by the output table. Possible formats are:

• data/raw_tabular• data/raw_sequential• data/votable • data/hdf5• data/fits

Service-Defined Parameters. The service MAY support additional service-specific parameters. The names,

meanings, and allowed values are defined by the service. The names need not be upper-case; however, they should not match any of the reserved parameter names defined above.


setSnap output

The setSnap output is an id to the setSnap Result

This id will be used in the next stages of data delivery


SNAP results

The result of the SNAP query consists in

• A VOTable(or XML descriptor file) with the description of the result and of the data

• A binary file with the extracted data

Both the VOTable and data could be delivered after a staging procedure (see later)

The description VOTable consists in the following elements:

• a RESOURCE element, identified with the tag type="results", containing one or more TABLE elements with the metadata results of the setSnap operation

• The TABLE in the output VOTable MUST contain FIELDs, that refer to the variables stored in the external binary file. FIELDS can be organized either as table or as sequences

• Variables must be scalars, i.e. vectors (or more general multidimensional quantities) are not supported. In this case some FIELDs represents the different components of the vector


SNAP results (cont.ed)

• The VOTable MUST contain a DATASERVICE parameter which identifies the used service.

• The VOTable MUST contain a REQUEST_ID parameter which identifies uniquely the job request on the service.

• The VOTable MUST contain a REQUEST_STATUS parameter which can be Ok or Rejected. In this last case all the other fields of the VOTable are not present.

• A single TABLE can contain different variables of the same species. Species can differ either by their geometrical representation (e.g. particles, regular meshes, AMR meshes…) or in their “physical meaning” (e.g. star particles vs. dark matter particles). All the FIELDS in a TABLE have the same number of elements, specified by the arraysize TABLE parameter. This parameter set also the geometry of the quantity. E.g. arraysize=”N” represents a point like quantity; arraysize=”NxMxS” represents a grid based variable. Resulting data FIELDS are stored one after the other in a single binary file, in the same order they appear in the VOTable.


SNAP results (cont.ed)

• Each TABLE MUST contain FIELDs where the UCDs have been set. FIELDS refer to the variables stored in the external binary file.

• Each FIELD must specify the datatype and the unit of the variable. Furthermore name and ID have to be set.

• The binary data file reference, acref, is specified in a DATA section

• Other parameters may be supported according to the services offered by the data provider.


SNAP VOTables examples 1

VOTable for the velocity field of a fluid on a fixed 3D mesh

<RESOURCE name="myVectorField" type="results" > <DESCRIPTION>Velocity Field from N-Body run</DESCRIPTION> <INFO name="QUERY_STATUS" value="OK"/> <TABLE name="VelocityField" ID="Vel" order="sequential” arraysize="41x41x41" > <FIELD name="vx" ID="vx1" ucd="phys.veloc;pos.cartesian.x" datatype="float"

unit="km/s" /> <FIELD name="vy" ID="vy1" ucd="phys.veloc;pos.cartesian.y" datatype="float"

unit="km/s"/> <FIELD name="vz" ID="vz1" ucd="phys.veloc;pos.cartesian.z" datatype="float"

unit="km/s"/> <DATA> <BINARY> <STREAM acref="file:///scratch/myhome/test.bin"/> </BINARY> </DATA> </TABLE> </RESOURCE></VOTABLE>


SNAP VOTables examples 3

VOTable for the temperature field of a mesh based quantity and the position

of N-Body particles extracted from the same spatial region. <RESOURCE name=myMixedData type="results"> <INFO name="QUERY_STATUS" value="OK"/> <TABLE name="Particles" ID="NBody" order="sequential” arraysize="100000"> <FIELD name="x" ID="x1" ucd="pos.cartesian;pos.cartesian.x“ datatype="float" unit="Mpc" /> <FIELD name="y" ID="y1" ucd="pos.cartesian;pos.cartesian.y“ datatype="float" unit="Mpc"/> <FIELD name="z" ID="z1" ucd="pos.cartesian;pos.cartesian.z” datatype="float"unit="Mpc"/> <DATA> <BINARY> <STREAM href="file:///scratch/myhome/particles.bin"/> </BINARY> </DATA> </TABLE>

<TABLE name=“Mesh" ID=“MeshTemp" order="sequential” arraysize=“41x41x41"> <FIELD name="temperature" ID="temp" ucd="phys.temperature;pos.cartesian“ datatype="float"

unit="K" /> <DATA> <BINARY> <STREAM href="file:///scratch/myhome/mesh.bin"/> </BINARY> </DATA> </TABLE></RESOURCE></VOTABLE>


5. DATA STAGING AND RESULTS


Data Staging

By Data Staging we refer to the processing the server performs to retrieve or

generate the requested simulation volume or subvolume and cache

them in online storage for retrieval by a client.

Staging is necessary for large archives which must retrieve simulation

data from hierarchical storage, or for services which can

dynamically extract subvolumes, where it may take a substantial

time (e.g. minutes or hours) to retrieve the data in the relevant region of

the simulation box

The snapshot staging service is optional for the simulation server. If staging

is not implemented, data should be immediately available for retrieval.

The availability of this function is communicated to the registry services.

The getSnap method is identical whether or not staging is used.


Data Delivery

As soon as staged data are available at the given URL, the user can start the

download procedure.

The user can be informed of the availability of the data following two

different approaches:

1. The client searches for the data on the service (e.g. reload a

web/ftp page).

2. The service searches for the client and, if present, sends

information to it.


Server messaging

Second approach:

the server must provide a messaging capability

The client must have an identity to be recognized by the service

The service broadcasts messages to identified clients whenever a staging

(processing) event occurs (e.g. data are available)

Service generated messages can also be used to pass informational or

diagnostic messages to clients as processing proceeds.


Client identification

Snap is not just a search-and-download service, but it requires also running

processes and, possibly, managing them

Therefore the authentication of the client should be required. This is

strictly required for approach 2, in which the user must be

detected and identified by the service.

However, authentication should be always necessary for security and

privacy reasons: access to the services should be granted only to

“trustable” users with proper privileges.

Authentication could be on a username-password basis or on some more

sophisticated methods, like certificates. This choice is up to the service

provider

Authentication allows the user to use the scheduling/batch system which is

implemented by the service provider.


Staging services

the provider should support at least two basic operations:

• Job monitoring

• Job cancellation

The specific implementation of the two operations depends on the adopted

service technology.

Both operations use the SERVICE and REQUEST_ID parameters written in the

VOTable. They are called using proper web methods:

• getSnapInfo(SERVICE, REQUEST_ID, SNAPINFO)

• cancelSnap(SERVICE, REQUEST_ID, SNAPINFO)

The getSnapInfo method returns a SNAPINFO string with the following

information: STATUS (Idle, Hold, Cancelled, Running, reJected),

SUBMISSION_DATE, other (up to the service provider, specified to the

registry). The cancelSnap method returns a SNAPINFO string that can

have the values “Ok” or “Rejected”.

Other services can be implemented and registered by the provider.


Data Delivery

The getSnap(acref, SERVICE, STATUS) web method allows a client to

retrieve a single binary simulation file and the corresponding XML

descriptor file given the reference, output of the setSnap method.

The files can be downloaded using http, ftp, grid ftp protocols (or any other

useful protocol). All the metadata about the content and the structure of

the data file is stored in the associated VOTable

XML header files (VOTables) are stored as well and they are downloaded

together with the binary file using the same getSnap method.

The getSnap method returns a STATUS string which can be Ok, Rejected or

Defferred (if data are not yet available).


6. FILE FORMATS AND STANDARDS


File formats

Data produced by simulation codes are stored in files with different and, usually, non-standard formats.

This make it difficult to handle and exchange data

E.g. Gadget as its own format file (although it supports also HDF5). This format has no access library support, it is not extensible, data access is not efficient, it is strictly linked to the application.

File formats should be:

• standard

• Flexible

• Extensible

• Portable

• Fast

• Easily usable by applications

• SELF DESCRIPTIVE

Possible solutions:

Raw Binaries

FITS

HDF5

VOTables

NetCDF

…


File formats: archive and results

Data file formats can be different according to their usage

Archive side files should be

• High performance (fast access)

• Standard (portable and persistent)

Result files should be

• Simple (specific I/O libraries are not required to access them)

• Self descriptive (e.g. XML metadata headers)

• Compressed (to minimize transfer effort)

In any case, data size is crucial. ASCII files are “deprecated”. Base64 (or similar) encoding for http transfers are to avoid. Waist of time (for conversions) and “space” (increased size).


Result files

A simple solution is represented by raw binary files with the following characteristics.

• In a file more variables can be stored

• Each variable represent a scalar quantity

• Components of multidimensional quantities are stored as separate variables

• Variables have the same number of elements but they can have different types

• Variables can be stored either as Tabular or as Sequential (see next slide)

• A descriptor file (XML) is associated to the binary to make it self-descriptive

Advantages: (little) standardization, simplicity, no I/O specific libraries required, fast access

Drawbacks: limited portability (endianism problem, data types), little standardization, no compression


Result files: Tabular vs Sequential

Tabular files are closer to observational data, so more compatible to a standard VOTable idea.

If the file contains the 3 variables vx, vy, vz, their Tabular storage is:

vx(1), vy(1), vz(1)

vx(2), vy(2), vz(2)

…

vx(N), vy(N), vz(N)

This is suitable for variables (like the components of a vector) which are always accessed as N-uple. Or for data analysis tools which need (and load) all the stored variables for their goal.

However it leads to poor performances if variables has to be loaded separately in memory. Loading one variable requires continuous jumps on the file.


Result files: Tabular vs Sequential (cont.ed)

Sequential files are a common choice for “simulators”

If the file contains the 2 variables rho and press, their Sequential storage is:rho(1)

rho(2)

…

rho(N)press(1)press(2)…press(N)

Each variable can be read with a single I/O call. This leads to high performance access to the file. This is typically required dealing with large files.


Archive files

Archive files are not “visible” to the end user. Therefore the data provider can choose any suitable format.

The choice should be in general driven by several properties:

• The format should be standard and well supported, in order to ensure the preservation of the data, their portability between different computing platforms, software, compilers... (if the technology changes we don’t want to change the data)

• The files should be fast and efficiently accessible, since data is large and complex operations could be necessary to handle it (e.g. extract the particles which falls in a certain region)

Various formats, with such features, are available.


File formats: HDF5

HDF5 (http://hdf.ncsa.uiuc.edu) represents a possible solution to deal with such data

HDF5 is• Portable between most of

modern platform• High performance• Well supported• Well documented• Rich of tools• Flexible and extendible

HDF5 data files are• Platform independent (portable)• Well organized• Self defined• Metadata enriched• Efficiently accessible

HDF5 drawbacks• Requires some expertise and

skill to be used• Information are difficult to

access• Can be subject to major library

changes (see HDF4 to HDF5)


File formats: HDF5 hierarchical structure and self-consistency

The data file can have a (complex) hierarchical, filesystem like, structure with groups (directories) and datasets (files)

The base group is “/” (root). Files can have only the root group

/BmMassDensity Dataset {512, 512, 512}

/BmTemperature Dataset {512, 512, 512}

/BmVelocity Dataset {512, 512, 512, 3}

/DmMassDensity Dataset {512, 512, 512}

/DmPosition Dataset {134217728, 3}

/DmVelocity Dataset {134217728, 3}

Or, they can store different simulation output times in different groups

/Time1/Mesh/BmMassDensity Dataset {512, 512, 512}

/Time1/Mesh/BmTemperature Dataset {512, 512, 512}

/Time1/Particles/DmPosition Dataset {134217728, 3}

/Time2/Mesh/BmMassDensity Dataset {512, 512, 512}

/Time2/Mesh/BmTemperature Dataset {512, 512, 512}

/Time2/Particles/DmPosition Dataset {134217728, 3}

HDF5 metadata make the file completely self-consistent

Structural metadata (strictly required from the library)

• rank• Dimensionality

Annotation metadata (required from our implementation)

• Data object name• Data object description• Unit• Formula

garching, april 2007 the simple numerical access protocol (snap) for theoretical data claudio...

Documents