a metadata catalogue for capri-rd - ilr.uni-bonn.de · comprised in a scenario run are deleted. the...

34
1 Common Agricultural Policy Regional Impact The Rural Development Dimension Collaborative project - Small to medium-scale focused research project under the Seventh Framework Programme Project No.: 226195 WP2.3 Databases CAPRI Database Extension and Quality Management Deliverable: D2.3.6 A metadata catalogue for CAPRI-RD Wolfgang Britz Institute for Food and Resource Economics University Bonn

Upload: tranhanh

Post on 25-Aug-2019

217 views

Category:

Documents


0 download

TRANSCRIPT

1

Common Agricultural Policy Regional Impact – The Rural Development Dimension

Collaborative project - Small to medium-scale focused research project under the Seventh Framework Programme

Project No.: 226195

WP2.3 Databases – CAPRI Database Extension and Quality Management

Deliverable: D2.3.6

A metadata catalogue for CAPRI-RD

Wolfgang Britz

Institute for Food and Resource Economics

University Bonn

2

Bonn, January, 2012

A metadata catalogue for CAPRI-RD

1. Motivation and background ............................................................................................... 4

Data flow in the CAPRI production chain ............................................................................. 4

2. Meta data standard implemented ........................................................................................ 6

3. Meta data for main data sets used in CAPRI ..................................................................... 9

COCO ..................................................................................................................................... 9

CAPREG ................................................................................................................................ 9

CAPRI Farm Type Layer ..................................................................................................... 10

GLOBAL .............................................................................................................................. 11

CAPTRD .............................................................................................................................. 15

Baseline calibration, international part ................................................................................. 16

4. Technical Meta-Data handling ......................................................................................... 16

5. Software solution .............................................................................................................. 20

Introduction of meta data into data files ............................................................................... 21

Introduction of meta data for generated data sets ................................................................ 22

Handling of meta data in GAMS .......................................................................................... 22

Check of meta data after a compile ...................................................................................... 23

Exploitation of meta data in the GUI ................................................................................... 24

6. Introduction of code fragments into GDX files................................................................ 26

Baseline calibration in CAPMOD ........................................................................................ 27

Meta data of results generated by CAPRI in XML format .................................................. 28

3

7. Summary and conclusions ................................................................................................ 30

Annex 1: Geographic Meta Data proposed for SENSOR and SEAMLESS ........................ 31

References ............................................................................................................................ 34

4

1. Motivation and background

CAPRI is a large-scale complex modelling system which processes many data sets from

different data providers. The work steps from raw data extraction to an ex-ante baseline are

manifold, and the outcome of each step may impact on how the model reacts in a policy

experiment. A successful maintenance of the system asks firstly for a clear documentation

which data are used where. Secondly, clear process definitions how to update these data are

needed. And thirdly, a technical infrastructure must be available supporting these processes.

The document is thought to serve mainly the first task, a clear documentation, while at the

same time also documenting the technical infrastructure developed to support the process of

data updates. The document is organized as follows. The first part discusses the data flow in

the CAPRI production chain, including a short characterization of the processes, followed by

proposing a meta data schemer. Then, for each of the processes, the different data sets

included are discussed and meta data from recent updates shown. The last part of the

documents discussed the technical solution chosen.

Data flow in the CAPRI production chain

The data flow between work steps in CAPRI is based on the I/O of so-called GDX files, an

internal binary data format of GAMS as. The different working steps output one or several

GDX files which are loaded by subsequent steps. The GDX files comprise one or several

GAMS symbols, under which at least one parameter comprising floating point numbers

labelled in CAPRI mnemonics. Additionally, as indicated below, most working steps also add

meta information stored as a SET to the parameter. The exact structure of the parameters

differs from case to case; the most often found rule is to have on the first dimension the

regions, on the seconds one products or activities, store items such as input/output coefficients

on the third and years on the fourth.

5

Graphic: Principal data flow in the CAPRI production chain

Raw

statistical

data

CoCo

GDX

CAPREG

Eurostat

Faostat ...

Raw

statistical

data

CAPTRD

GDX

GLOBAL

Raw

statistical

data

CAPMOD

GDX

GDX

FAPRI,

AgLink ..

Projections

CAPDIS

GDX

CORINE,

Soil map ..

Raw

statistical

data

CoCo (Completeness & Consistency) is a mostly statistically based approach where data from

different statistical domains and data providers are integrated into a time series data base at

national level which covers mainly market balances, unit value prices, the position of the

Economic Accounts for Agriculture, acreage and herd sizes as well as output coefficients. The

COCO data are passed both the CAPREG to add the sub-national spatial dimensions as well

as to CAPTRD for the-so called “nowcasting exercise”.

Next in the production chain is CAPREG, responsible of adding the sub-national regional

and farm type breakdown of herd sizes, acreages and crop yields to the time series provided

by CoCo. It adds also input coefficients, crop nutrient requirements and related fertilizer

application rates and animal requirement and feeding coefficients. It also derives manifold

economic and environmental indicators from the results. CAPREG is logically broken down

in three applications: the first one constructs the time series, the second one derives from the

time series the three year average, adds the feed distribution and calibrates the supply models

for the base year for the NUTS2 regions whereas the third one uses the NUTS2 base year data

to perform the same tasks for the farm type layer. CAPREG outputs besides its own results

also the results of COCO (typically unchanged), both following the structure of an activity

Kommentar [b1]: Update of figure necessary

6

based tables of accounts. These data flow both into CAPTRD and the simulation model

calibration.

Both CoCo or CAPREG may be started for single Member States, or a different set of years,

and CAPREG also for different base years. That flexibility has proven extremely useful to

speed up processing and to avoid being forced to always update the information for 27 EU

Member states, Norway, Turkey and Western Balkan countries even if only some data for a

single countries needs a revision. But the flexibility clearly carries the risk to forget to run the

program after a data base update or a change in the code for some of the countries. As CoCO

and CAPREG produce huge file comprising data for all countries together, the only

information so far was provided by the file system on when the file was stored the last time.

Most of the international data not covered by the COCO-CAPREG processing chain are

processed by GLOBAL and passed from there both to CAPTRD and to CAPMOD (in

baseline modus for the calibration of the international part). Similar to COCO, GLOBAL

collects information from many data providers on many topics related to agriculture (market

balances, land use, yields, bi-lateral trade flows in quantities and values as well as derived

unit values and transport cost margins, border protection information), both ex-post and ex-

ante.

The regional and national time series from CAPREG, along with recent national updates from

COCO which are not complete and where not yet regional data are available, enter the

baseline engine CAPTRD, where they together with outlooks from other institutions and

experts feed into a set of mutually consistent data on market balances, acreages and herd

sizes, yields and other output coefficients and feed coefficients for one or several future

simulation year. As with CoCo and CAPREG, CAPTRD may be started for single countries,

and at the NUTS 0, NUTS II level or farm type level.

The base year data from COCO/CAPREG along with the outlook results from CAPTRD and

base year and external outlook data from processed from GLOBAL enter the model

calibration step which then provides the basis for actual counterfactual scenario runs.

2. Meta data standard implemented

The search for an appropriate meta data standard for data bases used in modelling agriculture

is not new, the two large scale European projects SENSOR and SEAMLESS have developed

7

a common standard (see Hazeu et. al 2006 and Annex 1), building on two more general

international respectively EU concepts. The basic one is ISO 19115 (“ISO 19115:2003

defines the schema required for describing geographic information and services. It provides

information about the identification, the extent, the quality, the spatial and temporal schema,

spatial reference, and distribution of digital geographic data ... Though ISO 19115:2003 is

applicable to digital data, its principles can be extended to many other forms of geographic

data such as maps, charts, and textual documents as well as non-geographic data.”.”, ISO

2011).

For the EU, the general concept is the so-called INSPIRE directive (Directive 2007/2/EC of

the European Parliament and of the Council of 14 March 2007 establishing an Infrastructure

for Spatial Information in the European Community (INSPIRE)): “To ensure that the spatial

data infrastructures of the Member States are compatible and usable in a Community and

transboundary context, the Directive requires that common Implementing Rules (IR) are

adopted in a number of specific areas (Metadata, Data Specifications, Network Services, Data

and Service Sharing and Monitoring and Reporting). These IRs are adopted as Commission

Decisions or Regulations, and are binding in their entirety” (from the INSPIRE homepage).

Commission Regulation (EC) No 1205/2008 actually lays down rules concerning metadata

used to describe the spatial data sets and services, which are further detailed in technical

guidelines (EU Commission 2010).

The focus of these initiatives is clearly on geographic meta-data (such as detail on the

geographic reference system) which are often limited relevance for data-sets building on

administrative units as typically used in economic modelling, equally, implementation of

metadata on metadata may be less important at current stage. However, in order to ensure

inter-operability, especially with bio-physical modelling, using a common standard is

certainly beneficial, and there seems no good reason to deviate from the one proposed by

Hazeu et. Al 2006. From that set, the following minimum attributes were selected which

should be documented for any major data set used in CAPRI:

Title of the data set * 15.24.360 Abstract * 15.25 Keywords * 15.33.53 Topic category * 15.41 Temporal coverage * Geographic coverage by name * Date of version * 15.24.362.394

8

Name of owner organisation * 15.29.376 Name of originator organisation 15.29.376 Name of distributor organisation Description of process steps 18.81.84.87 Language within the data set * 15.39 Name of exchange format * 15.32.285

Further elements may be added when thought necessary. That standard set is handled

explicitly by the programs, in the sense that e.g. meta data referring to data for countries not

comprised in a scenario run are deleted. The following specific notes apply to these fields:

The temporal coverage should document the years for which the data are available. A

sequence of years can be indicated by a “-“, such as in 1990-2000. It is assumed that

any data set used on CAPRI has a periodicity of one year (which can be an integral

over time, e.g. production quantities, or a (set of) representative observation(s) such as

in case of price notations).

The keywords should where applicable be taken from

http://www.eionet.europa.eu/gemet/inspire_themes (cp EU0201, page 23 ), but those

are rather general and require additional detail to be useful to search the meta data

catalogue of CAPRI. Another classification, focusing on agriculture, is proposed by

the FAO under the AGRIS initiative (International Information System for the

Agricultural Sciences and Technology) and can be found at http://agris.fao.org/.

The name of the exchange format should refer to the format by which the data are

integrated in the actual CAPRI work flow (typically GAMS tables, GAMS GDX,

GAMS csv).

The description of process steps should refer to the work step in the CAPRI work

flow.

Date of version should refer to the data when the data were collected from the

provider, or, in case of a further processing step (such as removal of outliers), to the

date of that cleaning.

The meta data set handled in CAPRI is not thought as sufficient information for a

successful consecutive update, but as the minimum information required to report data

sources and check internal consistency in data use.

9

Any additional technical information necessary to update the data should be documented

whenever possible in the GAMS source where the meta-data is integrated. One should be

possible with a search on disk to locate the META set implemented in the gams file, and

find there the relevant information to repeat the update. That additional information

should typically cover:

The reference/URL used the last time to collect the data

The original format in which the data were collected (e.g. XLS)

Information where the original can be found (e.g. where in the SVN repo or where

on disk of which collaborator)

A description how the raw data were transformed to a GAMS readable format, in

sufficient detail to repeat the process.

When necessary and available, references to a technical doc where data

transformation processes are described and/or the methodological documentation

3. Meta data for main data sets used in CAPRI

COCO

In COCO, the main data provider is Eurostat. Additionally, especially for EU (potential)

candidate countries, data from national providers are integrated and complemented by FAO.

CAPREG

In CAPREG, main data provider is Eurostat, with the REGIO domain and the Farm Structure

Survey. There are further data sets loaded into CAPREG as shown in the following.

Additionally, there are different small static data sets which relate e.g. to bio-physical

attributes such as nutrient content of manure and typically not explicitly mentioned. These

factors are in most cases documented in the methodological documentation of CAPRI.

The REGIO data (crop acreages and yields, herd sizes) are incorporated in the file

“capreg\build_regio_gdx.gms” which reads GAMS text files from “dat\capreg”. The text files

are generated by ... (contact Andrea!). The main data sets used are:

...

10

The related meta-data are shown below.

The Farm Structure Data provide more dis-aggregated data on acreages and herd sizes in

relation to the activity definitions. They are not available as long term time series and thus

only complement REGIO data in CAPREG. They can be downloaded from the EUROSTAT

website (contact Andrea for details). The related meta data are shown below:

Further on, CAPREG inputs data from the European Fertilizer Manufactures Association

/EFMA) regarding total national fertilizer use and expert based fertilizer application rates of

different crops. They are processed via “dat\fert\ifa_dat.gms”, the actual data are stored in a

GDX file, the related meta data are shown below and enter the fertilizer distribution

estimation.

In order to estimate the input allocation, CAPREG uses the so-called “Gross Margins” which

are also used in FADN to determine the Economic Size Unit. These data are stored in

“dat\inputs\data_sgm.gms” as a ASCII table. The related meta information is shown below

CAPRI Farm Type Layer

11

GLOBAL

Most of the international data not covered by the COCO-CAPREG processing chain are

processed by GLOBAL and passed from there both to CAPTRD and to CAPMOD (in

baseline modus). Whereas the historical data (with the exemption of bio-fuels) are to a large

extent taken from FAOSTAT, ensuring harmonized definitions and global coverage, there is

no single outlook data results which combines global coverage with the necessary detail in

commodity and regional break down. As a consequence, the integration of the different

outlooks (see below) is more demanding.

Historical data

The production, market balance, land use data and trade flow data stem mostly from

FAOSTAT. The mapping between the CAPRI regions and FAOSTAT is defined in

“global\fao_codes.gms”. Most other data sources (such as tariff data or long term forecasts

from IMPACT) comprise matching sets between the FAOSTAT regional codes and the

mnemonics of the other sources. By doing so, “global\fao_codes.gms” should be only place

where the matching between individual countries and the regional break down of CAPRI’s

global market model is defined. Albeit the data are generally available as time series, so far,

the time series dimension is not systematically used, in opposite to CAPTRD, in the

preparation of the international outlook.

The FAO data on land use are processed in “global\load_fao_data.gms”. The raw land use

data had been downloaded from the FAO web sites in ASCII format and stored in the file

“data\global\fao_landstat.gms”, meta data see below.

The same program “global\load_fao_data.gms” also loads the FAO supply utilization

accounts, market balances defined in primary product equivalents ...

Population data including long term projections are stemming from UN as seen below and

loaded via “dat\arm\allpop.gms”

12

Marco-economic data (GDP, household expenditure, exchange rates against US $) are taken

from UN, as seen below.

They are complemented by Euro conversion rates from “dat\coco\exint.csv” for the US and

loaded via the program “global\load_gdpunstats.gms”.

Bilateral trade flow data are processed in “global\tradeFlow.gms”. The raw data themselves

are stored in a GDX which is generated ...

Tariff data still stem from AMAD (see global\aggreg_tar.gms). The file defines matching

sets between CAPRI products and HS6 codes. The actual data in HS 6 resolution for

(selected) WTO members are split into several text files: '..\dat\global\importsall3.gms'

comprises import quantities and values, whereas '..\dat\global\tar_specific.gms',

'..\dat\global\tar_specific_applied.gms', '..\dat\global\tar_adval.gms' and

'..\dat\global\tar_adval_applied.gms' WTO bound and applied rates at HS 6 level. As to our

information, AMAD

(http://www.amad.org/pages/0,2987,en_35049325_35049378_1_1_1_1_1,00.html) is not

longer maintained, the last data released in 2007 but time series ending in 2003, it might be

important find a new source of the data.

13

Transport cost margins are estimated along with cif/fob prices from FAO trade matrices.

For the transport costs estimation, the CAPRI products are grouped.

An extremely complex bit of information are the international data on biofuels ...

Outlook data

Outlook data prepared by “global.gms” are used in the ex-ante baseline of CAPRI, both in

CAPTRD (for countries, regions and farm type represented by supply models) and in the

baseline calibration of CAPMOD for the global market model. The outlook results are

collected from AGLINK-COSIMO and FAPRI for medium term, currently until 2020, as well

as from IMPACT and FAO for the long term. As mentioned above, for regions not covered by

supply models, in opposite to CAPTRD, the international time series are not used in CAPRI

to complement the international outlooks.

The actual “merging” of the different outlooks as well as interpolation between medium and

long term outlooks is performed during the baseline calibration in “arm\data_prep.gms”.

Generally, only relative changes from the base year to the simulation year are used for the

international part to avoid differences evolving from the fact that the outlook use different

historical data sets. Generally, for each outlook result sets, mapping to CAPRI regarding

regions, market balance elements/acreages/prices and commodities is necessary. That is

generally achieved by the definition of cross-sets in GAMS.

The interface to AGLINK-COSIMO went through a series of larger updates over the last

years as OECD re-factored AGLINK, i.e. changed mnemonics and variable lists. The

different versions are called “convert_AGLINKxxx” where xxx stands for different

years/version (empty for the per 2009 version, 2009,2010, 2011dgAgri, 2011oecd) which load

specific GDX containers of matching historical/outlook data sets stored in “dat\baseline” and

require different maps between CAPRI mnemonics and those used in AGLINK

(aglink_map*.gms). Maintaining backward compatibility to older outlook releases is mainly

issue during the test phase when a new baseline is developed.

The meta data for the 2010 release (there is also a 2011 preliminary version available) is

shown below)

14

The use of the FAPRI outlook has lost importance after AGLINK integrated with COSIMO a

model also capturing non-OECD countries. As in the case of AGLINK-COSIMO, there is a

conversion program (“global\convert_fapri.gms”) which maps selected elements of the

FAPRI outlook to CAPRI mnemonics. Due to the far less templated structure of FAPRI, that

task is rather tricky, especially as FAPRI often uses domestic prices in connection with

exchange rates. The actual data are integrated as comma deleted table in the file

“dat\global\fapri_dat_2011.gms” along with matching set definitions, meta data as seen

below.

The long term projections are handled in “global\f2050_impact.gms”. There are two

different sets. Results sets from the FAO at2050 exercise are stored in GAMS tables

covering historical data from 1984-2003 and outlooks for 2015, 2030 and 2050. The GAMS

tables are original generated from XLS files provided by the FAO global perspectives unit.

Baseline results from IFPRI’s IMPACT model are downloaded as EXCEL files and via

GAMS utility stored as GDX containers. They comprise market balances, yields, prices and

per capita GDP. A specific problem of the IMPACT projections seems to be constant on the

demand side in any year (might be stock changes or statistical balances), as a remedy, a small

model closes the global balances for the relevant projection years and corrects the demand

positions such that the constant vanishes.

15

CAPTRD

The main input into CAPTRD are the time series from CAPREG and the AGLINK/COSIMO

baseline (see above for GLOBAL). However, there are some other so-called “expert data”

which define a priori expectations besides those derived from trend estimates and outlook

results. All these additional “supports” enter the program “captrd\expert_support.gms”, and

can be grouped in three categories:

1. Expert supports for individual countries. The sources differ, for Finland, to give an

example, they stem from the DREMFIA model, in other cases they are the outcome of

consultations with national experts. The files seem to report in sufficient detail the

original data provider, if an update is feasible or important is not clear.

2. Supports for sugar market ( ‘captrd\kws_support.gms’). These supports on

sugarbeet areas in different countries had been developed in close co-operation with

the German seed producer KWS, the actual data are stored in

“dat\captrd\kws_input.gms”.

3. Biofuel supports. The program (‘biofuel\bio_trends.gms) read a baseline from the

PRIMES model comprised in GDX container to derive supports for use of agricultural

and non-agricultural feedstock to bio-fuel production as well as for fossil fuel use. The

data can be scaled by an additional program to the AGLINK projections.

(captrd\scale_biofuel_to_dgAgrixx.gms, different versions available). That program

also reads a file with many different data on bio-fuels (dat\biofuel\biofuel_data.gms),

but seems to extract only data on consumer taxes.

It seems that meta data, if at all, should be introduced for the PRIMES data set. However, the

fate of PRIMES is at current stage anyhow unclear.

16

Baseline calibration, international part

During the baseline calibration, the international part reads data generated by global plus

some further data.

A weak spot if it comes to documentation are the border protection data not taken from

AMAD (see above) which are mainly TRQs. There are several files (dat\arm\trq_ucl.gms,

dat\arm\trq_iap.gms; dat\arm\trq_norway.gms; dat\arm\trq_aglink.gms) where the history of

the data is not fully clear. Currently, no meta data are attached, also how to update the data is

not clear.

EU notifications regarding subsidized exports are stored in ‘dat\arm\EU_subs_export.gms’

and serve to parameterize the reacted behavioural functions in the global market model.

Equally, FEOGA budget data are integrated via the file “data\arm\feoga_new_2004.gms”.

These data are currently used together with the subsidized exports to derive the costs and

maximal amounts of market interventions / subsidized exports, meta data are found below.

4. Technical Meta-Data handling

All the processes discussed above (COCO, CAPREG etc.) store the numerical results in GDX

files, and consecutive steps read them from the GDX files as input. It seemed therefore most

promising to introduce meta information in the GDX files themselves. The big advantage

would be that shipping the data from one workstation to another or to and from the SVN

server would also automatically move the meta information along. On top, the meta

information could be also processed and viewed with regular GAMS tools as seen below.

17

As many of the programs responsible for specific work step may also only started for a sub-

set of the data (e.g. only for one or several Member States, only for NUTS 0), it may easily to

use data sets in the different work step which reflect different versions of the data – either

regarding updates of the underlying statistical raw data, the version of the algorithms

employed or the steering options of the programs. Controlling when, who, how and based on

what data a certain work step in the production chain was performed increases transparency

and avoids pitfalls as forgetting to run one of the step, and thus using outdated or even

erroneous data.

Maintaining a clear meta data documentation also underlines which data needs to be updated

in regular intervals, and, if properly done, also delivers information where the information can

be found. Part of the document is also a technical documentation where which data are

currently stored.

When the Java based GUI was developed, a software solution was integrated which stored

meta data about starts of the various GAMS programs in a XML-file, and presented them to

the user. The idea was to keep track about actions of the user and to allow comparing the

order of the work step and task in the production chain with the dates of execution. Besides

struggling with the fact that sometimes the information was updated despite the GAMS

process not finishing successfully, results for certain work steps are now often downloaded

from the Software Version system (SVN), and the GUI will in that case not notice that the

meta information would need to be updated.

Additionally, once a work step is executed, the information about the previous status is lost

albeit existing result sets may still be based on it. That has to change. On top, it seems

important to ensure that the meta data information is comprised in the same files as the data

themselves. The paper therefore discussed as a successfully implemented solution to handle

meta data based on storing the information in the GDX files along with the numerical values.

18

Table: Old meta data information in the GUI

The work step flow of CAPRI is based on the I/O of so-called GDX files, an internal binary

data format of GAMS as seen below. CoCo (Completeness & Consistency) is a mostly

statistically based approach where data from different statistical domains and data providers

are integrated into a time series data base at national level which covers mainly market

balances, unit value prices, the position of the Economic Accounts for Agriculture, acreage

and herd sizes as well as output coefficients. It is the common data base for both CAPSIM

and CAPRI. Next in the production chain is CAPREG, responsible of adding the regional

breakdown of herd sizes, acreages and crop yields to the time series provided by CoCo, but

also input coefficients, crop nutrient requirements and related fertilizer application rates and

animal requirement and feeding coefficients. It also derive manifold economic and

environmental indicators from the results. Both CoCo or CAPREG may be started for single

Member States, or a different set of years, and CAPREG also for different base years. That

flexibility has proven extremely useful to avoid being forced to always update the information

for 27 EU Member states, Norway and Western Balkan countries even if only some data for a

single countries needs a revision. But the flexibility clearly carries the risk to forget to run the

program after a data base update or a change in the code for some of the countries. As CoCO

and CAPREG produce huge file comprising data for all countries together, the only

information so far was provided by the file system on when the file was stored the last time.

The regional and national time series from CAPREG enter the projection engine CAPTRD,

where they together with projections from other studies and experts feed into a set of mutually

consistent data on market balances, acreages and herd sizes, yields and other output

coefficients and feed coefficients for a future simulation year. As with CoCo and CAPREG,

CAPTRD may be started for single countries, and at the NUTS 0 or NUTS II level.

19

A calibration of the simulation engine CAPMOD requires additional data for other parts of

the world and bi-lateral trade flows in quantities and values, data which are processed by

GLOBAL.

All these processes store the numerical results in GDX files, and consecutive steps read them

from the GDX files as input. It seemed therefore most promising to introduce meta

information in the GDX files themselves. The big advantage would be that shipping the data

from one workstation to another or to and from the SVN server would also automatically

move the meta information along. On top, the meta information could be also processed and

viewed with regular GAMS tools as seen below.

The main challenge consists in the fact that meta information requires in some cases strings to

pass information e.g. about the user. “Missing” the double precision numerical data stored in

parameters in GAMS for meta data is therefore cumbersome. The GDX format allows

however to store and load GAMS SETs which are collection of strings – a unique string to

identify the set element and an attached long text. SETS in GAMS can be multi-dimensional

and are therefore flexible enough to host the necessary meta information.

Meta data as a cross set in GDX file and viewed by the GAMS IDE

In order to generate the meta information in a uniform standard and to ensure that it is passed

to the programs, it is generated automatically by the CAPRI GUI before a GAMS process is

called, or in case where data sets so far had been generated by other means, the meta

information had been added manually. It is therefore even more important then in the past that

users use the GUI to run work steps in the production chain, as otherwise, the meta

20

information will most probably not correctly updated. The example below shows the

information passed in the current prototype solution.

Example of meta information stored in a GAMS set

The information comprises in a such set can be stored in a GDX file, preserving the long texts

shown in the right hand side, and loaded into a GAMS program or, via a library, accessed by

higher programming languages.

In the production chain, each program will load along with the numerical data provided by

upstream work steps the meta information linked to it, and add its own. In a simulation result

set, it is therefore possible to check when and by whom the national time series from CoCo

had been generated etc. When information from the SVN server about the version of the

underlying code is integrated as well, such an approach will greatly improve the transparency

in the CAPRI production chain, and increase the confidence when comparing policy

experiments.

5. Software solution

The technical implementation relates to three questions:

1. Introduction of meta data in the different data files to document ingoing data of

various origins (raw statistical data, estimations etc.)

2. Introduction of meta data for data sets generated by GAMS programs started via the

GUI.

3. Handling of meta data inside the GAMS code, and later during exploitation.

21

Introduction of meta data into data files

In that case, the information must be edited by hand. The following code fragment shows an

example for the REGIO data base from EUROSTAT:

An important feature in that respect that the possibility to generate long texts simultaneously

for vectors of codes (in the example the SET.MS_EU15), renders the statement rather

compact.

A specific problem with that solution is that fact that GAMS will expand the long text for the

individual set elements if the elements of set.MS_EU15 have long texts, the first entry might

then e.g. read “Belgium GDX” instead of “GDX”. That can be avoided by a slight refactoring

based on idea by Michael Bussieck from GAMS.COM:

(1) First define a cross sets between the regions and the current work step:

(2) Use set table in META set definition:

However, maual editing is always a both cumbersome and error prone process. It is therefore

advisable to rapidly implement conversion routines from raw statistical data into GAMS code

in JAVA, to integrate them in the GUI and to introduce the meta data automatically into the

data files. Alternatively, when GAMS processes are used to generate GDX file, these should

introduce the metadata fields in the GDX. However, there will always be a number of files

(e.g. data from AGLINK-COSIMO, from the international fertilizer association etc.) where

meta data need to be added manually. But once all files had been documented once, updating

of the meta data is relatively straightforward.

22

Introduction of meta data for generated data sets

The technical implementation is based in the general approach in CAPRI. The JAVA based

GUI will pass along with other steering information the meta information about the run in

GAMS format to a work step as shown above. The necessary information is collected from

user input in the GUI, many of which are identical to settings for a specific model run. The

different work steps in CAPRI are defined as object with attributes which already match part

of the meta data definition.

The following screenshot shows a block of meta information generated automatically by the

GUI.

Handling of meta data in GAMS

The work steps will first load the existing meta information from previous work step at

compile as shown below. That is necessary to allow redefinition based on the “$ONMULTI”

settings to stepwise append further information to the META data set. In order to allow that

new meta data overwrite existing one from previous runs, the meta data from older runs need

to be loaded first before new meta data is introduced.

Code fragment showing how to load meta data from a GDX file at

compile time

23

Further statement in the code introduce then meta data about the current run (loaded via the

generated GAMS code in files a forreg.gms or fortran.gms), and in different data file. At the

end of the programs, typically three statements will refer to meta data:

1. Storage of the meta data in the GDX data file together with the numerical values.

2. Temporary output of the meta data at compile time to allow the user to view meta data

of the different data sets processed by the program before actual program execution.

The temporary file will be deleted before program execution, and after the compile

step, a dialogs allows the user to load the meta into the tabular viewer.

3. Output of the meta data in the temporary file to check the results directly after

program execution, i.e. at run time, as above.

Check of meta data after a compile

At the end of the run, the JAVA GUI checks if the file “meta.gdx” had been generated. If that

is the case, it queries the user if he wants to load the meta information. In the latter case, the

meta data are loaded in the tabular viewer.

24

The user may now control in detail what data had been processed by the program. Currently,

in CAPREG, solely the REGIO data are documented by meta data, further data sets as the

Standard Gross Margins, the fertilizer data etc. should follow soon.

Exploitation of meta data in the GUI

The screenshot below shows the expanded GUI where now meta data can be queried as well.

Once the user presses “Exploit meta information”, the long texts from the cross-set “META”

are loaded from the GDX-files into a table and shown to the user, as seen below. The

underlying java classes are a slightly expanded of the earlier version. As a by-product,

CAPRI’s GDX viewer loads now also SETs into a tabular view.

25

The new GUI with exploitation possibilities for meta data

It may be astonishing to see that the meta information is four dimensional (member state, task,

task, item), but that structure turned out as being useful. A three dimensional structure

(member state, task, item) would in the table above report e.g. the generation date of the base

year data from CAPREG used by the simulation run and the same information for the trends.

It would however not reveal without loading in parallel meta information for the trends if the

trends and the simulation run were based on the very same version of the CAPREG data. It

was therefore judged necessary to use four dimensions.

Underlying is the ability of GAMS to copy also the long texts attached to a cross sets when

copying it:

26

The above code fragment from CAPTRD shows how the meta data are loaded from the GDX

file with the CAPREG base year data, and then are copied over to the CAPTRD column, and

then being discarded. That allows consecutive steps using the trend projection to see the meta

data attached to the base year data used by CAPTRD.

Currently, the meta data are integrated in the “tabular views” to allow the user to perform

rapidly a few checks, e.g to compare the generate date and time of the different data sets used.

6. Introduction of code fragments into GDX files

Alexander Gocht has programmed a small Java program which stores the lines in a text file

into a GDX file. That program can be executed by GAMS at compile time to document

important parts of the code directly in the result set generated by the program. It uses the very

same basic mechanism as above by storing strings as longtext description of GAMS sets. The

facility is currently integrated in CAPMOD to report the content of “fortran.gms”, i.e. all the

information passed from the GUI to the model, and the content of the included policy file.

The following two statement show how (1) a possibly existing file named

“%CURDIR%\temp.gdx” is deleted, and (2) the content of fortran.gms is stored in the set

“SET_POL” in the file “%CUR_DIR%\temp.gdx”.

A similar statement adds then the content of the policy file:

At the end of the program, the content of the set is loaded from the GDX file and then

outputted to the GDX file along with the numerical values and meta data:

The content of the file can be opened either with the GAMSIDE or with the “exploit GDX”

facility of the GUI:

27

That possibility clearly further on increases the transparency of the data as e.g. the details of

the policy parameters passed in can be controlled.

Baseline calibration in CAPMOD

The main additional data entering during baseline calibration relate to policy instruments. On

the supply side, various information from legal texts is integrated in the current baseline

policy definition file “mtr_hc.gms” and files included by it. In most cases, such information is

not available in read to use technical format, but needs to be edited manually (or via OCR).

The meta data consists here in most cases of references to the Official Journal of the EU and

is integrated via comments in the GAMS code and not in the above given META data format

which would be too heavy (refer to premium documentation).

EU budget data are sampled in spreadsheets and from converted in GAMS ASCII files ..

28

WTO notification data by the EU on subsidized exports are taken from WTO websites ...

Data on Tariff Rate Quotas ...

Meta data of results generated by CAPRI in XML format

Besides a systematic documentation of data entering the CAPRI production chain (the main

aim of the document), the interaction with other modelling groups requires also a standardized

protocol to describe the output. Such interactions are increasingly common, in the past,

CAPRI results had been used e.g. by the MITERRA model or by FFSIM. It turned often quite

tedious to interact with these groups as a clear single entry description of the result set

generated in terms of coverage and resolution, mnemonics and units used was missing.

In order to provide such as description, the following DDT scheme is proposed:

The idea is to document the coverage/resolution in time (which years, comparative static

results) as well as in space (which regions) and the resolution, especially which type of

information is available at regional level, national level, trade blocks.

The data available from a CAPRI simulation run are grouped in four blocks: (1) those

available at regional/farm type level, (2) those additionally available at country level for those

countries covered by supply models, (3) data available for countries or block of countries with

behavioural equations in the market model, and (4) those available for countries or block of

countries with bi-lateral trade representation in the market model.

For each block of data first the list of regions is given with the code used in CAPRI and the

long text.

29

After that list, the individual data cells are documented. The example below shows the data

items documented for the soft wheat production activity. The catalogue lists for each element

available (see also DTD scheme above):

The type: input and output coefficients, farm management data, economic and

environmental indicators) (type)

The key for that element in CAPRI mnemonics (key)

The related long text (longText)

The physical unit (unit)

The concatenated code in the data base (code)

30

Similar lists are available for the three other blocks. For the international part, the XML file

also documents the relation to the FAO data, both for regions

And for products:

The format is extendable such that also the mappings e.g. to EUROSTAT codes could be

integrated if available. The XML file is generated by the file “capriUnits.gms” and uses the

results from the current ex-ante baseline to filter out empty table cells.

7. Summary and conclusions

The paper provides (1) an overview of the different data sets used in CAPRI, (2) shows how

meta data are systematically integrated in the data flow and (3) discusses how these meta data

are integrated in the GUI and finally (4) discusses a XML schemer and related data to

document CAPRI outout.

To (1). It is clear that is does not make to introduce meta data for each and any data set used

in the same systematic way. It is however important that meta data are available for all data

sets, but it is sometime more useful to introduce them as program comments directly in the

GAMS code. For larger and important data which are subject to regular updates, the proposed

standardized format in GAMS should be used so that these meta data can be inspected by the

GUI. The document also underlines that the updates of the data, necessary checks and

corrections notwithstanding, are by themselves a time demanding exercise.

31

To (2) and (3). The GDX based passing of meta information is a step forward to a more

transparent handling of meta data in CAPRI. It integration in the data files ensures that the

information is not lost when the files are copied to the SVN-server, downloaded from the

SVN-server or in another way shifted between directories or workstations. The integration in

the usual exploitation tools make it easy to handle them for users familiar with the GUI.

However, understanding the information provided is another task which may require some

training. As often, the system will receive its final layout once used intensively in the real

production chain.

The main advantage of the proposed solution is that the user can check for any result set per

Member State meta data on the underlying data sets, and at least check if the very same input

data had been used for different steps. The meta data will also only be available if the related

GAMS program has successfully stored as well the numerical values. As such, the

implementation is less error prone and more transparent then the existing one based on a

separate XML file which was never really put to work.

A further possibility based on JAVA code by Alexander Gocht is the possibility to dump the

content of text file as e.g. selected GAMS code passages as sets into a GDX, which allows

storing critical code fragments together with the numerical data.

The expansion of the data viewer to handle also strings instead of floats may lead to some

surprises, and some further Java coding may be necessary to avoid run time errors. It must be

decided to what extent e.g. export of the meta information to the clipboard or files is

necessary.

To (4): that additional functionality is mainly of importance for model linkage.

Annex 1: Geographic Meta Data proposed for SENSOR and

SEAMLESS

The basic set proposed comprises the following attributes:

Issue Required Iso code Title * 15.24.360 Metadata on metadata Point of contact: Name of contact organisation * 8.376 Name of contact person * 8.375 Position of contact person 8.377

32

Role of organisation 8.379 Address: Delivery point * 8.378.389.381 Address: City * 8.378.389.382 Address: Province, state * 8.378.389.383 Address: Postal code * 8.378.389.384 Address: Country * 8.378.389.385 Address: E-mail * 8.378.389.386 Weblink * Last modified * 9 Name of standard 10 Version of standard 11 Data set identification: Title of the data set * 15.24.360 Alternative title * 15.24.361 Abstract * 15.25 Keywords * 15.33.53 Topic category * 15.41 Temporal coverage * Version of data set * 15.24.363 Date of version * 15.24.362.394 Reference system: Name of reference system (*) 13.196.207 Datum name (*) 13.192.207 Ellipsoid: Name of ellipsoid (*) 13.191.207 Semi-major axis (*) 13.193.202 Axis units (*) 13.193.203 Flattering ratio (*) 13.193.204 Projection: Name of projection (*) 13.190.207 Standard parallel (*) 13.194.217 Longitude of central meridian (*) 13.194.218 Latitude of projection origin (*) 13.194.219 False easting (*) 13.194.220 False northing (*) 13.194.221 False easting northing units (*) 13.194.222 Scale factor at equator (*) 13.194.223 Longitude of projection centre (*) 13.194.224 Latitude of projection centre (*) 13.194.225 Distribution information: Owner: Name of owner organisation * 15.29.376 Name of contact person 15.29.375 Position of contact person 15.29.377 Role of owner organisation 15.29.379 Address: Delivery point 15.29.378.389.381 Address: City 15.29.378.389.382 Address: Province, state 15.29.378.389.383 Address: Postal code 15.29.378.389.384

33

Address: Country 15.29.378.389.385 Address: E-mail 15.29.378.389.386 Originator: Name of originator organisation 15.29.376 Name of contact person Position of contact person Role of originator organisation Address: Delivery point Address: City Address: Province, state Address: Postal code Address: Country Address: E-mail Processor: Name of processor organisation Name of contact person Position of contact person Role of processor organisation Address: Delivery point Address: City Address: Province, state Address: Postal code Address: Country Address: E-mail Distributor: Name of distributor organisation Name of contact person Position of contact person Role in distributor organisation Address: Delivery point Address: City Address: Province, state Address: Postal code Address: Country Address: E-mail On-line delivery Access rights: Type of constraint 20.70 Description of restriction 20.72 Other information: Language within the data set * 15.39 Exchange format: Name of exchange format * 15.32.285 Version of exchange format * 15.32.286 Methodology description: 18.81.83 Link to methodological report Changes since last version Process steps: Description of process steps 18.81.84.87

34

Resource name 18.81.84.91.360 Resource date 18.81.84.91.362 Scale * 15.38.60.57 Geographic accuracy 15.38.60.57 Geographic box: 15.38.61 West bound longitude (*) 15.45.336.344 East bound longitude (*) 15.45.336.345 South bound latitude (*) 15.45.336.346 North bound latitude (*) 15.45.336.347 Geographic coverage by name * List of attributes Data type (vector / raster)

References

FAO (GILW, Library and Documentation Systems Division) 2005. The AGRIS Application

Profile for the International Information System on Agricultural Sciences and Technology,

Guidelines on Best Practices for Information Object Description

(http://www.fao.org/docrep/008/ae909e/ae909e00.htm)

Hazeu, G., Verhoog, D., and Andersen, E. Metadata of environmental, farming system, socio-

economic and global data selected to be implemented in the knowledge base. SEAMLESS

deliverables PD 4.3.1., PD 4.4.1, PD 4.5.1., PD 4.6.1, February 2006

ISO2011. ISO 19115:2003. General information from

http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=26020

EU-Commission 2008. Commission Regulation (EC) No 1205/2008 of 3 December 2008

implementing Directive 2007/2/EC of the European Parliament and of the Council as regards

metadata, OJ L 326, 4.12.2008, p. 12–30

EU-Commission 2010. INSPIRE Metadata Implementing Rules: Technical Guidelines based

on EN ISO 19115 and EN ISO 19119. Document identifier

MD_IR_and_ISO_v1_2_20100616, see

http://inspire.jrc.ec.europa.eu/documents/Metadata/INSPIRE_MD_IR_and_ISO_v1_2_20100

616.pdf