sbs workshop: “structural business statistics on ... · web viewdomain estimators or other...

in partnership with

Title: Guidelines (including options) on how the BR interacts with the SDWH

WP: 2 Deliverable: 2.2.2

Version: 4.0 - Final Date: 2-10-2013

Autor: Pieter Vlag NSI: Netherlands

ESS - NET

ON MICRO DATA LINKING AND DATA WAREHOUSING IN PRODUCTION OF BUSINESS STATISTICS

Contents

Contents......................................................................................................................................21. Introduction................................................................................................................................4

1.1 Definition of a statistical DataWareHouse.......................................................................4

1.2 The position of the Business Register in a statistical-DWH..............................................5

1.3 The relationship between this document and architectural and technical elements of a statistical-DWH...........................................................................................................................6

2.3 Linking different data sources: units and populations......................................................8

2.4 The Business Register and the population frame..............................................................9

2.5 The Business Register and the statistical DWH................................................................9

3. Statistical units and population.................................................................................................10

3.1 Statistical units and population.......................................................................................10

3.2 A statistical-DWH: the population frame........................................................................11

3.3 Backbone of the statistical-DWH: integrated population frame, turnover and employment...............................................................................................................................12

3.4 Target populations of active enterprises.........................................................................15

4. Linking datasources to the statistical unit.................................................................................16

4.1 Linking other datasources to the backbone of the statistical-DWH................................16

4.2 Variation in input units....................................................................................................17

4.3 Variation in output units..................................................................................................17

4.4 The statistical unit and the process of a statistical-DWH...............................................18

4.5 The concept of a statistical unit base..............................................................................18

5 Correcting information in the population frame and feedback to SBR.....................................20

5.1 The position of the Business Register in a statistical Data Warehouse..........................20

5.2 Dealing with conflicting information..............................................................................22

5.3 Panel surveys and correcting population characteristics...............................................23

5.4 Timing of correcting data in the backbone and the SBR.................................................23

2

5.5 Timing of feedback to the SBR.........................................................................................24

6 Conclusions..............................................................................................................................25

3

1. Introduction

1.1 Definition of a statistical Datawarehouse

The main goal of the ESSnet on “micro data linking and data warehousing” is to prepare recommendations about better use of data that already exist in the statistical system and to create fully integrated data sets for enterprise and trade statistics at micro level: a 'data warehouse' approach to statistics.

The broad definition of a data warehouse to be used in this ESSnet is:

‘A common conceptual model for managing all available data of interest, enabling the NSI to (re)use this data to create new data/new outputs, to produce the necessary information and perform reporting and analysis, regardless of the data’s source.’ (FPA, 2010)

The project describes a generic datawarehouse (DWH) for statistics or statistical datawarehouse (statistical-DWH) as: a central statistical data store, regardless of the data’s source, for managing all available data of interest, improving the NSI to:- (re)use data to create new data/new outputs,- perform reporting,- execute analysis,- produce the necessary statistical output.

This corresponds with a central repository able to support several kind of data, micro, macro and meta, entering into the S-DWH in order to support cross-domain production processes and statistical design, fully integrated in terms of data, metadata, process and instruments.

In practice, the statistical-DWH is subdivided into two separate environments:

The first is where all available information is collected and built-up, usually defined as Extraction, Transformation and Loading (ETL) functions. The aim of this environment is to create a set of fully integrated data.

Second is the actual data warehouse, i.e. where data analysis, or mining, and reports for executives are realised. The aim of this environment is to disseminate the fully integrated data as consistent outputs.

Figure 1 shows a schematic view of a commercial data warehouse with staging area. Data staging is represented as a small part of the datawarehouse which is focussed on data analysis, data mining and reporting tools. This illustrates the main difference between a commercial and a statistical-DWH. The statistical-DWH is also focussed on the staging area, i.e. the process of creating a set of fully integrated data (from different data sources), while the commercial datawarehouse is – in most cases - focussed on the actual datawarehouse to produce flexible outputs etc.

4

Figure 1 Architecture of a Data Warehouse with staging area. Illustration taken from Oracle9/DataWareHousing Guide (2002).

Workpackage 2 (WP 2) of this ESSnet covers all essential methodological elements for designing, building and implementing the statistical-DWH. It concentrates on the methodological aspects of creating a set of integrated data This document describes an essential part of the creation of an integrated dataset: the role of the statistical business register as a frame to integrate data from different sources.

1.2 The position of the Business Register in a statistical-DWH

The purpose of this document is to describe the central role of the statistical units, population frame, which includes number of enterprises, total turnover derived from the Value Added Tax (VAT) data, total employment derived from social security data

in a statistical-DWH. It is the reference to which all flexible input data are linked to obtain an integrated set of data.

The position of the Business Register in a statistical-DWH is relatively simple in general terms. The Business Register provides information about statistical units, the population, turnover derived from VAT and wages plus employment derived from tax and/or social security data. As this information is available for almost all units, the Business Register allows us to produce flexible output for turnover, employment and number of enterprises.

5

The aim of the statistical-DWH is to link all other information to the Business Register in order to produce consistent and flexible output for other variables. In order to achieve this, an architectural and technical structure of a statistical-DWH has been developed. This architecture is described in paragraph 1.3.

We realise that some National Statistical Institutes (NSI) have separate production systems to calculate totals for turnover and employment outside the Statistical Business Register (SBR). These systems are linked to the population frame of the SBR. The advantage of doing this is that such a separate process acknowledges that producing admin data based turnover and employment estimates requires specified knowledge about tax rules and definition issues. Nevertheless the final result of calculating admin data based totals for turnover and employment within or outside the SBR is the same. As this tax information is available for almost all units and linked with the SBR, it is possible to produce flexible output for turnover, employments and number of enterprises regardless of whether totals are calculated within or outside the Business Register.

Therefore, we discuss the role of (flexible) population totals like number of enterprises, turnover and employment in a statistical-DWH, but we don’t discuss whether total of turnover and employment should be calculated within or outside the SBR. This decision is left to the individual NSI.

The same is true for whether the SBR is part of the statistical-DWH or not. It is up to an individual NSI whether or not the statistical-DWH uses extracts of statistical units, population frame, total turnover derived from the Value Added Tax (VAT) data, total employment derived from social security data.

from the SBR for period t or includes the entire SBR-system with all this information in the statistical-DWH. In chapter 2.5, however, we will discuss some pros and cons of including the SBR in a statistical-DWH or not.

1.3 The relationship between this document and architectural and technical elements of a statistical-DWH

Another workpackage (WP 3) covers all essential architectural and technical elements for designing, building and implementing the statistical-DWH. Basically, workpackage 3 has linked the GSPBM (Generic Statistical Business Process Model) sub-processes to the statistical-DWH concept. As a result, it has provided a Business Architecture for the statistical-DWH. Moreover, it has proposed a modular workflow for the statistical-DWH in order to manage the information flow between data sources and the central administration of a statistical-DWH. To do this; it uses four functional layers: data source layer, integration layer,

6

interpretation and data analysis layer, data presentation layer.

Figure 2 shows the GSBPM model. Figure 3 show the relationship between the phases of the statistical process as defined by the GSBPM and the functional layers as proposed by the workpackage 3 team.

Note that statistical (enterprise) units, which are needed to link independent input data sets with the population frame in turn and to relate the input data to statistical estimates, play an important role in the processing phase of the GSBPM. This processing phase corresponds with the integration layer of the statistical-DWH.

In the next chapters of this document, we will discuss in more detail where populations and statistical units play a crucial role and how this interfaces with the business registers.

Figure 2 A schematic sketch of the GSBPM (Generic Statistical Business Process Model). Note that the GSBPM divides the statistical process into 9 phases. These phases are divided into subprocesses.

7

Figure 3 Relationships between the layers of a statistical-DWH and the statistical processes according to the GSBPM (Generic Statistical Business Process Model).

2.3 Linking different data sources: units and populations

The aim of a statistical-DWH is to create a set of fully integrated data pertaining to enterprises, which enables a statistical institute to produce flexible and consistent output. The original data come from different data sources. Collection of these data takes place in the collect phase of the Business Architecture (fig. 2 – sub-process 4).

In practice, different data sources may cover different populations. The coverage differences may be for different reasons:a. the definition of an enterprise differs between the sources, i.e. sources have different

units.b. sources may include (or exclude) groups of enterprises which are excluded (or included)

in other sources.

An example of the latter is the VAT-registration versus survey data. VAT-data (and some other tax data like corporate tax data) do not include the smallest enterprises, but include all other commercial enterprises. Survey samples contain information about a small selected group of enterprises, including the smallest enterprises. Hence, linking data of several sources is not only a matter of linking enterprises between the different input data but also a matter of relating all input data to a reference, the so-called population frame.

Different sources may have different units. For example, surveys are based on statistical units (which generally corresponds with legal units), while VAT-units may be based on enterprise groups (as in the Netherlands). Hence, when linking VAT-data and survey-data to the target population, it is important to agree to which units data are linked.

8

Summarising, when linking several input data in a statistical-DWH, one has to agree about

the population frame, i.e. the reference to which all data sources are linked, the enterprise unit to which all input data are matched.

Both challenges will be addressed in this deliverable. The technical aspects about linking of several data-sources are described in deliverable 2.4 of the ESSnet on Datawarehousing (DWH).

2.4 The Business Register and the population frame

Member States of the European Union maintain business registers for statistical purposes as a conduit for the preparation and coordination of surveys, as a source of information for the statistical analysis of the business population and its demography, for the use of administrative data, and for the identification and construction of statistical units. The Regulation (EC) No 177/2008 of the European Parliament and the Council (EC) sets out a common framework for the harmonisation of the national business registers for statistical purposes and Article 7 of the Regulation asks for the publication of a business register recommendation manual. The manual aims to explain the reasoning behind the provisions of the Regulation. It aims to provide the extra information required for the correct and consistent interpretation of the Regulation in all countries. The latest edition of the manual was published in 2010. The manual has been updated in close cooperation with the Member States.The regulation and manual implicitly imply that the business register contains at least a statistical unit, a name and address of the statistical unit, an activity-code (NACE), starting and a stopping date of enterprises.

The implication for the statistical-DWH is that the required information about the reference or population frame, e.g. units and populations (see paragraph 2.3), can be obtained from the SBR. Hence, the SBR is a crucial input for the statistical-DWH.

2.5 The Business Register and the statistical DWH

The implication of previous paragraphs is that the population frame derived from the SBR is a crucial part of the statistical-DWH. It is the reference to which all data sources are linked. However, this does not mean that the SBR itself is part of the statistical-DWH. A very good practical solution is that

the population frame is derived from the SBR for every period t these snapshots of population characteristics for periods tx are used in the statistical-

DWH.

9

By choosing this option the maintenance of the SBR is separated from maintenance of the statistical-DWH. Both systems are however linked by the same population characteristics for period t. This option is called SBR outside the statistical DWH.

Another option is that the entire SBR-system is included in the statistical-DWH. The advantage of this approach is that corrected information about populations in the statistical-DWH is immediately implemented in the SBR. However, this may lead to consistency problems if outputs are produced outside the statistical-DWH (as the ‘corrected’ information is not automatically incorporated in these parts of the SBR). Maintenance problems may arise as a system including both the production of a SBR as well as flexible statistical outputs may be large and quite complex. This option is called SBR inside the statistical DWH.

In our opinion, it is up the individual NSIs whether the SBR should be inside or outside the statistical-DWH because the coverage of the statistical-DWH (it may include all statistical input and outputs or only parts of the in- and outputs) may differ for different countries. Furthermore, we did not investigate the crucial maintenance factor.

In the remaining part of this report, we consider the option “SBR outside the statistical DWH” only. This choice has been made for the sake of clarity. Apart from last paragraph of chapter 6, which is not relevant in the case of “SBR inside the statistical DWH”, this choice does not affect the other conclusions of this report.

3. Statistical units and population

3.1 Statistical units and population

Taking into account the expected recommendations of the ESSnet on Consistency, it is proposed that the statistical enterprise unit is the standard unit in business statistics. Ideally, the statistical community should have the common goal that all Member States use a unique identifier for enterprises based on the statistical unit. Therefore, the statistical-DWH uses the statistical enterprise units as standard units only. As long as a unique identifier for enterprises is not realised yet, data from sources not using the statistical unit are linked to the statistical unit in a statistical-DWH. For further analyses, it is recommended that the statistical-DWH only uses the statistical unit as a standard, because it is quite complicated to use several units in treatment of data. As a consequence, (standard) enterprise populations are based on statistical units.

In line with the SBS-regulation the following definition for an enterprise population is used in this paper: all enterprises with a certain kind of activity being economically active during the reference period. For annual statistics this means that the target population consists of all enterprises active during the year, including the starters and stoppers (and the new/stopping units due to merging and splitting companies). Such a population is called the target population in methodological terms, i.e. the population to which the estimates refer. The NACE-code is used to classify the kind of activity.

10

Note that target populations can be flexible in a statistical-DWH, because a statistical-DWH is meant to produce flexible outputs. When processing and analysing data, it is recommended to consider the target populations of the annual SBS and monthly or quarterly STS. These are important obligatory statistics. More importantly, these statistics define the enterprise population to its widest extent. According to regulations, they include all enterprises with some economic activity during (part of) the period. Hence, by using these populations as standard:

all other data sources could be linked to this standard, because they cannot cover a wider population in the SBS/STS domain from a theoretical point of view.

all other publications derived from the statistical-DWH are basically subgroups from the SBS/STS-estimates.

Furthermore, the output obligations of the annual SBS and monthly or quarterly STS are quite detailed in terms of different kind of activities (NACE-codes). We propose that the SBS and STS-output obligations are used as standard to check, link, clean and weight the input data in the processing phase of the statistical-DWH, too (see figs. 2/3).

A statistical-DWH is designed to produce flexible output. However, as the standard SBS- and STS-populations are the widest in terms of economic activity during the period and quite detailed in terms of kind of activity, most other populations can be considered as subpopulations of these standards. Examples of subpopulation are:

large or small enterprises only, all active enterprises active at a certain date, even more detailed kind of activity populations (i.e. estimates at NACE 3/4-digit level).

Domain estimators or other estimation techniques can be used to determine these subtotals, if the amount of available data is sufficient and there are no problems with statistical disclosure. Estimation techniques and outlier detection in flexible outputs are more extensively discussed in other deliverables of the ESSnet on Datawarehousing. Checking, cleaning, integrating and weighting the input data in a statistical-DWH are further discussed in chapter 5 of this deliverable, but we also refer to other deliverables of the ESSnet on Datawarehousing for further information.

3.2 A statistical-DWH: the population frame

To determine the population in the statistical-DWH, two types of information are needed: the population frame, i.e. a list of enterprises with a certain kind of activity during a

period, information to determine which enterprises of the list really performed economic

activities during a period.As previously mentioned, the population frame is derived from the SBR. This population frame consists of all enterprises within the SBR during the year, regardless of whether they

11

are active or not. To derive activity status and subpopulations, it is recommended that the population frame includes the following information:1) the frame reference year2) the statistical unit enterprise, including its national ID and its EGR ID1

3) the name and address of the enterprise4) the date in population (mm/yr)5) the date out of population (mm/yr)6) the NACE-code7) the institutional sector code8) a size class2

Note that a population frame is crucial for a statistical-DWH. Target populations, i.e. populations belonging to estimates, for the flexible outputs are derived from it!

To determine the activity status of an enterprise, i.e. to estimate whether enterprises really carried our economic activities, a comparison with VAT and/or employment data is needed. If VAT reveals turnover above a certain threshold and/or employment data paid wages above a certain threshold, the enterprise is considered as active. This will be discussed in the next chapter (3.3) and is one of the reasons that VAT and/or employment data are crucial elements of the statistical-DWH. Chapter 3.4 and 3.5 discuss how target populations can be determined in two specific cases:

the statistical-DWH is limited to annual statistics, the statistical-DWH includes short-term statistics, too.

3.3 Backbone of the statistical-DWH: integrated population frame, turnover and employment

The results of the ESSnet on Admin Data showed that VAT and social security data can be used for turnover and employment estimates when quasi complete. The latter is the case for annual statistics and for quarterly statistics in most European countries on the continent. Note however that VAT and social security data can only be used for statistical purposes if a) the data transfer from the tax office to the statistical institute is guaranteed and b) the link with the statistical unit is established. As already mentioned in chapter 1.2,

it is possible to process the VAT and employment data within the SBR it is possible to have separate systems for processing VAT and social security data linked

to the SBR

to obtain totals for turnover and employment. In this paper we do not discuss the pros and cons of each approach as it is a partly organisational decision for the NSIs. For this paper, we assume that totals are produced for

1 arbitrary ID assigned by the EGR system to enterprises, it is advised to include this ID in the Datawarehouse to enable comparatibility between the country specific estimates

2 could be based on employment data

12

1. number of enterprises 2. turnover,3. employment

with administrative data covering quasi-all enterprises in the SBS/STS domain. These totals are integrated because they are all based on the statistical unit and all classified by activity by using the NACE-code from the population frame. Hence, these three integrated totals together represent the basic characteristics of the enterprise population. Therefore, these three totals can be considered as the backbone of the statistical-DWH. All other data sources are linked to these three totals in statistical-DWH and made consistent with them. This chapter mentions some aspects for VAT and social security data.

VAT and social security cover almost all enterprises in the domain covered by the SBS and STS-regulations and are available in a timely manner (i.e. earlier than most annual statistics). They are crucial

1. to determine the activity status of the enterprises and implicitly to determine the target populations of active enterprises,

2. to create a fully integrated dataset suitable for flexible outputs, because these administrative data sources contain information about almost all enterprises (unlike survey which contain only information of a small sample of enterprises).

The latter reason is explained further in the remainder of this section. When (quasi) complete VAT and social security data can be used to produce good-quality estimates of turnover and employment. Therefore, these estimates can – together with the population frame (i.e number of enterprises, NACE-code etc.) be used as benchmarks when incorporating results of survey sampling in a statistical-DWH. In this case totals of turnovers and employment define, together with the number of enterprises, the basic population characteristics. These three characteristics are assumed to be correct unless otherwise proven. Other datasets or surveys covering more specific parts of the population should be made consistent with these three main characteristics of the entire population. In the case of inconsistencies, the population characteristics are considered as correct, survey data or other datasets are modified by adapting weights or data editing. As these three main characteristics (population frame, turnover, employment) are integrated, available at micro-level (statistical unit) considered as correct and all other sources are linked and made consistent to them, these characteristics are the backbone of the statistical-DWH. This backbone is considered as the authoritative source of the statistical-DWH because its information is assumed to be correct unless otherwise proven.

The concept of the backbone improves the quality of integrated datasets and flexible outputs of a statistical-DWH. This is because more auxiliary information, in addition to the number of

13

enterprises, is used when weighting survey results (or other datasets) or when imputing missing values. For example, VAT and social security data can be used as auxiliary information when weighting survey results of variables derived from surveys. Many literature studies have proven that estimates based on weighting techniques using auxiliary information (e.g. ratio or GREG-type estimators) produce lower sampling errors than estimates without using auxiliary information when weighting (when survey variables are well correlated with the auxiliary variables). Using VAT and social security data as auxiliary information when weighting also corrects for unrepresentativity in the datasources. Hence, it improves the accuracy of estimates (and reduces its biases) for variables which are derived from data sources representing a specific part of the population. We refer to other deliverables of the ESSnet on Data Warehousing for further details about this subject.

Summarising using a backbone with integrated population, turnover and employment data

improves the quality of a fully integrated dataset using several input data sets, as two key variables for statistical outputs (turnover and employment) can be estimated precisely,

reduces the impact of sampling errors or biases in estimates for variables derived from other data sources, because turnover and/or employment can be used as auxiliary information when weighting.

As the first condition is the aim of a statistical-DWH and the second condition is required to produce flexible output (especially about subgroups of the standard SBS and STS-population), this is the main argumentation to consider a backbone of integrated totals of number of enterprises (=population), employment and turnover as the heart of a statistical-DWH.

The second reason to consider a backbone with integrated data about number of enterprises (=population), employment and turnover as the heart of the statistical-DWH is the determination of the activity status of an enterprise. Several NSIs use VAT- and social security data for this purpose. More precisely, enterprises are considered as active if VAT and/or social security data are available for the reference period or the previous period (in case of late VAT or late social security data). This method is preferred over a survey to determine the activity status, because the latter might be biased due to high non-response rates under the enterprises that had ceased trading. Summarising, VAT and social security data are crucial to determine whether an enterprise has been active or not. Hence, quasi-complete turnover and employment data are crucial to determine target populations consisting of active enterprises. This is the second reason to use a backbone of integrated totals of number of enterprises (=population), employment and turnover in a statistical-DWH.

A schematic sketch of the position of the backbone with integrated population, turnover and employment data is provided in figure 4.

14

Figure 4 Illustration of the position of the SBR and the backbone with integrated data about number of enterprises (=population), VAT-turnover and employment derived from social security data in a statistical-DWH. This backbone is represented by a line within the GSBPM phase 5.1. All other data sources are integrated to this backbone at GSBPM phase 5.1, which is at the beginning of the processing phase. The same backbone is also used for weighting when producing outputs at the end of the processing phase (see line in GSBPM steps 5.7 and 5.8). In this figure VAT, social security data and population are represented as different datasources with separate processes to integrate them. Note that this integration can also be done within the SBR (dotted lines via SBR) or outside the SBR (dotted lines directly to turnover, employment etc.).

3.4 Target populations of active enterprises

3.4.1 Case 1: Statistical DataWareHouse is limited to annual statistics

The determination of a target population with active enterprises only is relatively easy, if the scope of the statistical-DWH is limited to annual statistics. This case is relatively easy because the required information about population totals, turnover and employment can be selected afterwards, i.e. when the year has finished. This is because annual surveys are designed after the year has ended and results of surveys and other datasources with annual data (like accountancy data + totals of four quarters) become available after the year has ended, too. Hence, no provisional populations are needed to link provisional data during the calendar year. Therefore, the population frame can be determined by

selecting all enterprises which are recorded in the SBR during the reference year

15

using the complete annual VAT and social security dataset to determine the activity status and totals for turnover and employment.

3.4.2 Case 2: the Statistical Data Warehouse includes short-term statistics

The determination of a target population with only active enterprises becomes more complicated when the production of short-term statistics is incorporated in the statistical DWH. In this case a provisional population frame for reference year t frame should be constructed at the end of year t-1, i.e. November or December. This population frame is used to design short-term surveys. It is also the starting point for the statistical-DWH. This provisional frame is called release 1 and formally it does not cover the entire population of year t as it does not contain the starting enterprises yet.

During the year the backbone of the statistical-DWH is regularly updated with new information about population (new, stopped, merged and splitted enterprises), activity, turnover and employment. The frequency of these updates depends on the updates of the SBR and related to this updating information provided by tha admin data holders (VAT and social security data). At the end of year t (or at the beginning of year t+1), a regular population frame for year t can be constructed. This regular population frame consists of all enterprises in the year and is called release 2.

The ESSnet of Administrative Data has observed that time-lags do exist between the registration of starting/stopping enterprises in the SBR (if based on Chamber of Commerce data) and other admin data sources like tax information or social security data. The impact of these time-lags differs for each country, because it depends

on the updates of both 1) the population frame in the SBR 2) VAT and social security data from the admin data holders (in the SBR),

the quality the underlying data sources.

Despite the different impact of the time-lags, the ESSnet on Administrative Data has shown that these time-lags do exist in every country and lead to revisions in estimates about active enterprises on a monthly and quarterly basis. This effect is enhanced, because the admin data are not entirely complete on a quarterly basis. These time-lag and incompleteness issues might be a consideration for choosing a low-frequency for updating the backbone in a statistical-DWH. For example, quarterly and/or bi-annual updates could be considered.

4. Linking datasources to the statistical unit

4.1 Linking other datasources to the backbone of the statistical-DWH

As previously mentioned, the backbone – or heart - of a statistical-DWH consists of an integrated set of

population characteristics (the so-called population frame)

16

turnover employment data.

The main characteristic of this backbone is that this integrated information is available at micro level. Hence, we have information about activity, size, turnover and employment of (almost) every enterprise. Population totals can be obtained by adding the information of the individual enterprises. This information can be derived from the SBR or – depending on the choice of a NSI – from systems linked to the SBR.

As previously mentioned the backbone is based on statistical units only. Hence, when other datasources are integrated with the backbone, they should be linked to the statistical units as a first step. Technical aspects of data linking are described in deliverable 2.4 of the ESSnet on Datawarehousing. The next chapter of this document addresses the question of what information is required to link the several input sources to the statistical unit.

4.2Variation in input units

Ideally a unique identifier for enterprises based on the statistical unit exists already. Data linkage is simple in this case. In practice, accountancy data, tax data (including VAT and social security data) and other data may be reported for different parts within an enterprise group. These data might be reported for the enterprise group as a whole, the underlying legal units or tax units consisting of other part of the enterprise group. The variation in units and the challenge of linking them depends on the national legislation. Therefore the impact of this issue differs for each country. The size of the enterprise also determines the variation is units and the complexity of linking them. For small enterprises one-to-one relationships between the different units can be assumed, but this assumption cannot be made for medium-sized and large enterprises. Nevertheless, whatever the extent of these issues in individual countries and whatever the determination of the statistical unit, it cannot be assumed that all input data link automatically to the statistical unit. Hence, the relationship between these ‘input’ units and the statistical units should be known before the data can be linked.

Data linking is of less importance when using surveys only, because surveys are generally based on statistical units (as they are designed from SBR information). Data linking is more important for the statistical-DWH, because it also uses other data sources.

4.3 Variation in output units

Most statistical estimates in enterpise statistics are produced on the statistical unit enterprise. Examples are SBS, STS and most institutional statistics. However, some output is produced on different units like local units, LKAUs, KAUs or enterprises groups. Again the complexity of linking these units depends on the country and size of the enterprises. Nevertheless, one-to-one relationships between these output units and the statistical enterprises unit cannot be taken for granted. Hence, relationships between the ‘output’ units and the statistical units should be known before flexible outputs can be generated. As producing flexible output is a main

17

characteristic of a statistical-DWH, the existence of several output units is an issue for a statistical-DWH.

4.4 The statistical unit and the process of a statistical-DWH

The simplest and most transparent statistical process can be generated by

linking all input sources to the statistical enterprise unit at the beginning of the processing phase (GSBPM-step 5.1 – see figures 2,3).

performing data cleaning, plausibility checks and data integration on statistical units only (GSBPM steps 5.2-5.6).

producing statistical output (GSBPM-steps 5.7-5.8) by default on the statistical unit and the target populations according to the SBS and STS regulations. Flexible outputs on other target populations and other units are also produced in these steps by using repeated weighting techniques and/or domain estimates. Technical aspects of these estimation methods are described in deliverable 2.8 of the ESSnet on Datawarehousing.

Note that it is theoretically possible to perform data analysis and data cleaning on several units simultaneously. However, the experience of Statistics Netherlands with cleaning VAT-data on statistical units and ‘implementing’ these changes on the original VAT-units too, reveal that the statistical process becomes quite complex. Therefore, it is proposed that

linking to the statistical units is carried out at the beginning of the processing phase only, the creation of a fully integrated dataset is done for statistical units only, statistical estimates for other units are produced at the end of the processing phase only, relationships between the different in- and output units on the one hand and the statistical

enterprise units on the other hand should be known (or estimated) beforehand.

4.5 The concept of a statistical unit base

As mentioned in paragraph 3.1, the statistical-DWH uses only one unit when processing the data: the statistical enterprise unit. The statistical community should have the aim that all Member States use a unique identifier for enterprises based on the statistical unit having the advantage that all datasources can be easily linked to the statistical-DWH. In practice, dataholders may use several definitions of enterprises in some countries. As a result, several enterprises units may exist. Related to this, different definitions of units may also exist when producing output (LKAU, KAU, etc.).

The relationship between the different in- and output units on the one hand and the statistical enterprise units on the other hand should be known (or estimated) before the processing phase, because it is a crucial step for datalinking and producing output. Maintaining this relationship in a database is recommended when outputs are produced by releases; e.g. newer more precise estimates when more data (sources) become available. This prevents redoing a time-consuming linking process at every flexible estimate.

18

It is proposed that the information about the different enterprise units and their relationships at microlevel is kept by using the concept of a so-called unit base. This base should at least contain

the statistical enterprise, which is the only unit used in the processing phase of the statistical-DWH.

the enterprise group, which is the unit for some output obligations. Moreover the enterprise group may be the base for tax and legal units, because in some countries, like the Netherlands, the enterprise unit is allowed to choose its own tax and legal units of the underlying enterprises.

The unit base contains the link between the statistical enterprise, the enterprise group and all other units. Of course, it should also include the relationship between the enterprise group and the statistical enterprise. In case of x-to-y relationships between the units, i.e. one statistical unit corresponds with several units in another data source or vice versa, the estimated share in terms of turnover (or employment) of the ‘data source’ units to the corresponding statistical enterprise(s) and enterprise group needs to be mentioned.. This share can be used to relate levels of variables from other datasources based on enterprises unit x1 to levels of turnover and employment in the backbone based on the (slightly different) statistical enterprise unit x2 . We refer to deliverable 2.4 of the ESSnet on Datawarehousing for further information about data linking and estimating shares..

The unit base can be subdivided into ‘input’ units, used to link the different dataset to the statistical enterprise unit at the beginning of the processing phase (GSBPM-step 5.1: “integrate data”) and ‘output’ unit used to produce output on units other than the statistical enterprise at the end of the processing phase (GSBPM-step 5.7/5.8 “calculate aggregated”).

Figure 5 illustrates the concept of a unit base. It shows that the unit base can be subdivided into

input units, used to link the datasources to the statistical enterprise unit at the beginning of the processing phase (GSBPM-step 5.1: “integrate data”)

output units, which are used to produce output about units other than the statistical enterprise at the end of the processing phase (GSBPM-step 5.7/5.8 “calculate aggregated”). An example is output about ‘enterprise groups’ LKAUs etc...

The exact contents of the unit base (and related to this its complexity) depends on

legislation for a particular country, output requirements and desired output of a statistical-DWH, available input data.

It is a matter of debate

whether the concept of a unit base should be included in the SBR or whether the concept of a unit base should result in a physically independent database.

19

In the case of the latter it is closely related to the SBR, because both contain the statistical enterprise. Basically, the choice depends on the complexity of the unit base. If the unit base is complex, the maintenance becomes more challencing and a separate unit base might be considered. The complexity depends on

the number of enterprise unit in a country the number of (flexible) datasources an NSI uses to produce statistics.

As these factors differ by country and NSI, the decision to include or exclude the concept of a unit base in the SBR depends on the individual NSI and won’t be discussed further in this paper.

Figure 5 Example of the concept of a unit base.

5 Correcting information in the population frame and feedback to SBR

5.1 The position of the Business Register in a statistical Data Warehouse

The position of the SBR in a statistical DWH is three-fold. More precisely

the SBR is the input source for the backbone of the statistical-DWH; integrated data about enterprise populations, turnover and employment,

20

the SBR is closely related to the unit base, the SBR is the sampling frame for the surveys, which is an another important data-source

of the statistical-DWH (for variables which cannot be derived from admin data).

The last point implies that errors in the backbone source, which might be detected during the statistical process, should be incorporated in the SBR. Hence, a process to incorporate revised information from the backbone in the statistical-DWH to the SBR should be established. By not doing this, the same errors will return in survey results in subsequent periods.

The key questions are:

At which step of the process of the statistical-DWH is the backbone corrected when errors are detected?

How is revised information from the backbone of integrated sources in the statistical-DWH incorporated in the SBR?

The position of the SBR and its relationships with the backbone, unit base and surveys is illustrated in fig. 6

Figure 6 This figure also shows the position of a) data-integration, b) ‘weighting/calculation of aggregates’ in the statistical process and c) the step with the statistical process at which the backbone of the statistical-DWH is corrected in case of influential errors: GSBPM-step 5.7. At this step also feedback to the SBR is provided. Note that data sources for the backbone are denoted by brown cylinders and other input data by light blue cylinders.

21

5.2 Dealing with conflicting information

As mentioned previously, the backbone of the statistical-DWH consists of an integrated set of

population characteristics (statistical enterprises units, size and activity, the so-called population frame),

turnover data derived from the Value Added Tax (VAT) data, employment data derived from social security data

at micro level. All other data sources (with information about other variables) are linked to the backbone, which again represents the main characteristics of the enterprise population in a statistical-DWH. The backbone is also used to check, clean and integrate all other data sources data at a micro level. During these steps, conflicting information between the data sources themselves and between the data of the backbone might be detected (in practice: will be detected). Conflicting information may in extremis lead to the conclusion that the backbone contains errors. Deliverable 2.8 of the ESSnet of DataWareHousing addresses the question how this conclusion might be drawn, because this deliverable deals with hierarchy between the different data sources.

Whatever the exact methodology for detection, errors in the backbone might have several origins. More specifically, they may be related to

errors in the data linking, errors in the population characteristics (units, NACE-codes, size classes of enterprises). errors in VAT- and or employment data,

Some errors may result in an erroneous estimation of the activity status and therefore the number of active enterprises and possibly the level of the estimates. Other error may reveal erroneous values, which may also lead to inconsistencies in level estimates. An example of an erroreous value is that (VAT)turnover in the backbone of the integrated sources differs considerably from the observed turnover and other variables in a survey. It is expected that most errors in the population frame are detected because other data sources like surveys and administrative data indicate that the enterprise has either another activity as recorded in the SBR or another size as recorded in the SBR.

If the backbone is of good quality, which is essential, its number of errors should be limited. Moreover, data cleaning plus data integration at micro level are basically independent of the number of active enterprises, NACE-code, size class, etc... Therefore, it is proposed to use the ‘original’ population frame – which is part of the backbone - for these steps, even after errors in it have been detected. Another reason for this proposal is that errors might be detected at several stages of the data cleaning and integration process. Errors in the backbone might become influential when survey data are weighted with the integrated (micro) data of the backbone (number of enterprises possibly supplemented with the auxiliary information like turnover, employment). This becomes visible when calculating aggregates at the end of the processing phase. Therefore, it is recommended that all errors in the backbone be corrected

22

before weighting and calculating aggregates! This correponds with (the beginning of) GSBPM-step 5.7 (“weighting”).

Note that in the case of errors due to data-linking the information used in the unit base should be corrected rather than the backbone in the statistical-DWH.

5.3 Panel surveys and correcting population characteristics

When integrating survey data with the backbone, errors in the backbone and implicitly in the SBR may be detected. This is especially true of surveys about produced goods, performed services and investments where information can be very useful in detecting errors in the NACE-code. However, carte blanche correction of NACE-codes etc. should be avoided since this could bias the backbone and the SBR. Bias arises because: some parts of the SBR are of better quality than others because they are surveyed. To prevent this drawback, one should be very careful as to how panel surveys are used to correct information in the backbone and SBR. Influential errrors in panel surveys, i.e. errors which significantly affect the estimates, should preferably be treated as outliers.

5.4 Timing of correcting data in the backbone and the SBR

The unit base, the SBR, VAT-data and employment data derived from registers have a crucial role in the linking and estimation process of the statistical-DWH. These data are also important for estimates of statistics possibly falling outside the scope of the statistical-DWH. Therefore, it is advisable that if the backbone is updated due to errors after confrontation with other data like surveys, the SBR and unit base are updated themselves too. This updating of both the backbone, the SBR (and the to the SBR related unit-base) is desirable to ensure that late information or later available input data are processed with the correct

information about the enterprise population, new surveys are designed with the correct enterprise population.

The disadvantage of correcting the backbone (and SBR) is that previous published estimates are revised when re-running the process with improved population information. More precisely, the previous published estimates are estimated with an uncorrected population frame and new estimates with a corrected population frame. This difference in population frame leads to revisions. If the influential error – which led to the correction of the backbone – is found when estimating a specific estimate x, this revision is desired as it is an improvement. However, as the statistical-DWH is used for several output also previous publised statistics, which were apparently not affected by this error, are also revised when rerunning the process. To limit the disadvantage of unexpected revisions when revising the backbone, the following recommendations are made: developing a good metadata system, i.e. which data belong to which estimate,

23

using the paradigm that the information in the backbone is correct unless otherwise proven. In other words, consider the backbone and SBR as authoritative sources which are corrected only, if the detected errors are certain and influential,

relating the timing of incorporating changes in backbone to the revision policy of the most important statistical outputs.

5.5 Timing of feedback to the SBR

It has been argued in the previous chapter that proven and influential errors in the

population characteristics, turnover, employment of the backbone, statistical unit and (concept of) unit base

should be accompanied by corrections in the SBR, too. This is because the backbone is strongly related to the SBR and unit base. In paragraph 5.4 it was argued that the timing of these updates ìn the backbone of the statistical-DWH should correspond with the timing of the revision action in the most important estimates. However, the timing of these corrections in the SBR is even more complex. This is because, the SBR primarily acts as a frame for survey sampling including for surveys falling outside the scope of the statistical DWH.

The importance of the timing can be best illustrated with an example. If the SBR is used as sampling frame for an STS-survey of current year t and the SBR is ‘suddenly’ updated with information from the statistical-DWH from last year t-1, a sudden – and misleading as far as timing is concerned - discontinuity in the STS-survey series occurs. The question is whether this discontinuity is desirable. The same applies for surveys falling outside the scope of the statistical-DWH.Therefore, it is advisable to develop a strategy for correcting information in the SBR. A possible strategy is: For the errors with such an impact that they cannot be neglected: correcting the backbone

and SBR at the same time (and as soon as possible). However, consultation with the stakeholders of the most important statistics outside the scope of the statistical-DWH is strongly recommended as these corrections may have impact on other statistics.

For less influential errors: corrections in the SBR are carried out at the end of the calendar year when all surveys are renewed or refreshed. In this case, preliminary estimates outside the statistical-DWH published within 12 months after the statistical year t are still on a SBR including known-errors. This is the price for continuity of STS-surveys and consistency with statistics falling outside the scope of the statistical-DWH. However, the final estimates published more than 12 months after statistical year t are on an improved SBR, i.e. a SBR corrected for known-errors.

24

6 ConclusionsTwo conditions are required for a successful statistical DWH. Firstly, the population is well defined. Secondly, one unit should be used in the statistical DWH, because it is – in practice – impossible to create integrated datasets for several (types of) enterprise units. The unit that should be used is the statistical enterprise. For the sake of efficiency the link between the statistical units and units of other data sources need to be stored. Therefore, for this storage the concept of a unit base was presented. Whether a unit base should be incorporated in the SBR or not depends on its complexity, i.e. how many different units of enteprises exist in a country and the number of used data sources.

An integrated set of

population characteristics (activity, size etc.) or the so-called population frame turnover derived from VAT employment derived from social security data

is desired for the statistical-DWH as these administrative data sources are available for almost all enterprises in the SBS/STS domain and therefore provide good information about the basic characteristics of the enterprises. All other data sources can be linked to this integrated set of population, turnover and employment data, which is therefore considered as for the backbone of the statistical-DWH. This backbone can also be considered as the authoritative source of the statistical-DWH as its information is considered as correct unless otherwise process in case of conflicting information from other data sources like surveys.

The SBR is an indirect source for the backbone of the statistical-DWH because

1) the population frame is derived from it (and depending on the scope the VAT and employment-data),

2) the unit base is strongly related to it

3) the surveys – another important data source for the statistical-DWH – are based on it.

Hence, when errors in the population are revealed after integrating different data sources, it is desired that these errors are corrected in the SBR, too. However, the timing of incorporating these corrections in the SBR (and the VAT and social security data) is extremely important due to multiple use of SBR-information in data sources within or beyond the scope of the statistical-DWH.

Finally there is an alternative approach. If the maintenance challenges are considered acceptable and there are no consistency problems then it is feasible that the entire SBR be incorporated within the statistical-DWH. If this alternative approach were to be followed it is still imperative that the principles outlined within this report be adhered to.

The choice of which path to take is entirely at the discretion of each individual country/NSI. It is more likely that this choice would be driven by cultural or legacy issues than by simple efficiencies.

25

sbs workshop: “structural business statistics on ... · web viewdomain estimators or other...

Documents