practical guide to data validationec.europa.eu/.../practical_guide_to_data_validation.pdf4 handbook...

44
PRACTICAL GUIDE TO DATA VALIDATION IN EUROSTAT EUROSTAT Statistical Office of the European Communities

Upload: others

Post on 21-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

PRACTICAL GUIDE

TO

DATA VALIDATION

IN

EUROSTAT

EUROSTAT

Statistical Office of the European Communities

delcada
Typewritten Text
delcada
Typewritten Text
2007 Edition
delcada
Typewritten Text
delcada
Typewritten Text
delcada
Typewritten Text
delcada
Typewritten Text
delcada
Typewritten Text
Page 2: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

1

Handbook on Data Validation in Eurostat

TABLE OF CONTENTS

1. Introduction ……………………………………………………………….............. 3 2. Data editing …………………………………………………………………........... 5 2.1 Literature review …………………………………………………………….... 5 2.2 Main general procedures adopted in Member States …………………...….. 6 2.2.1 Foreign Trade ............................................................................................ 6 2.2.2 Industrial Output ...................................................................................... 6 2.2.3 Commercial Companies Survey .............................................................. 7 2.2.4 Employment Survey …………………………………………………….. 7 2.2.5 Private Sector Statistics on Earnings …………………………………... 7 2.2.6 Survey of Business Owners …………………………………………….. 8 2.2.7 Building Permits Survey ........................................................................... 8 2.3 Main general procedures adopted in Eurostat ………………………………. 8 2.3.1 Harmonization of national data ………………………………………... 9 2.3.2 Corrections using data from the same Member State ……………....... 9 2.3.3 Corrections using data from other Member States ………………....... 9 2.3.4 Foreign Trade ………………………………………….………………... 10 2.3.5 Transport Statistics …………………………………….……………….. 10 2.3.6 Labour Force Survey ………………………………….………………... 10 2.3.7 Eurofarm ………………………………………………………………… 10 2.4 Guidelines for data editing ……………………………………………………. 10 2.4.1 Stages of data editing …………………………………….……………... 10 2.4.2 Micro data …………...………………………………….……………….. 11 2.4.2.1 Error detection ………………………………………….………. 11 2.4.2.2 Error correction ………………………………………….……... 12 2.4.3 Country data …………...………………………………….…………….. 13 2.4.3.1 Error detection ………………………………………….………. 13 2.4.3.2 Error correction ………………………………………….……... 14 2.4.4 Aggregate (Eurostat) data .………………………………….………….. 16 2.4.5 Concluding remarks …....………………………………….…………… 16 3. Missing data and imputation …………………………………………………….. 18 3.1 Literature review …………………………………………………………........ 18 3.1.1 Single imputation methods .………………………………….…………. 18 3.1.1.1 Explicit modelling ………………………………………….…… 18 3.1.1.1.1 Mean imputation ……………………………………… 18 3.1.1.1.2 Regression imputation ………………………….…...... 19 3.1.1.2 Implicit modelling ……………………………………….……… 19 3.1.1.2.1 Hot deck imputation ………………………………….. 19 3.1.1.2.2 Substitution ………...………………………………….. 20 3.1.1.2.3 Cold deck imputation ………………………...….......... 20 3.1.1.2.4 Composite methods ….………………………...……… 21 3.1.2 Multiple imputation methods .………………………………….………. 21

Page 3: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

2

Handbook on Data Validation in Eurostat

3.2 Main general procedures adopted in Member States ………………………. 21 3.2.1 Foreign Trade ............................................................................................ 21 3.2.2 Industrial Output ...................................................................................... 22 3.2.3 Commercial Companies Survey .............................................................. 22 3.2.4 Employment Survey …………………………………………………….. 23 3.2.5 Annual Survey of Hours and Earnings ………………………………… 23 3.2.6 Survey of Business Owners …………………………………………...... 23 3.2.7 Building Permits Survey ........................................................................... 23 3.2.8 Housing Rents Survey ............................................................................... 23 3.2.9 Basic Monthly Survey ............................................................................... 24 3.3 Main general procedures adopted in Eurostat ………………………………. 24 3.3.1 Community Innovation Survey .………………………………………... 24 3.3.2 Continuing Vocational Training Survey ……………............................. 25 3.3.3 European Community Household Panel ………………......................... 25 3.4 Guidelines for data imputation ……………………………………………….. 25 3.4.1 Stages of imputation of missing data .…………………………………... 25 3.4.2 Micro data …………...………………………………….……………….. 26 3.4.3 Country data …………...………………………………….…………….. 26 3.4.4 Aggregate (Eurostat) data .………………………………….………….. 28 3.4.5 Concluding remarks …....………………………………….…………… 28 4. Advanced validation ……………………………………………………………… 30 4.1 Literature review …………………………………………………………........ 30 4.1.1 Strategies for handling outliers .………………………………….…….. 30 4.1.2 Testing for discordancy ………..………………………………….…..... 30 4.1.2.1 Exploratory data analysis ………………………………............. 31 4.1.2.2 Statistical testing for outliers …………………………………... 31 4.1.2.2.1 Single outlier tests …………………………………...... 32 4.1.2.2.2 Multiple outlier tests ………………………………...... 33 4.1.2.3 Multivariate data ………………………………………………... 33 4.1.3 Methods of accommodation ………..…………………………………... 34 4.1.3.1 Estimation of location ………………………………................... 35 4.1.3.2 Estimation of dispersion ………………………………............... 35 4.1.4 Time series analysis ………..…………………………………………...... 35 4.2 Main general procedures adopted in Member States ………………………. 37 4.2.1 Foreign Trade ............................................................................................ 37 4.2.2 Consumer Price Index .............................................................................. 38 4.3 Main general procedures adopted in Eurostat ………………………………. 38 4.3.1 Community Innovation Survey .………………………………………... 39 4.4 Guidelines for advanced validation ………………………………………....... 39 4.4.1 Stages of advanced validation .………………………………….…........ 39 4.4.2 Micro data …………...………………………………….……………….. 39 4.4.2.1 Advanced detection of problems ……………………………….. 40 4.4.2.2 Error correction ………………………………………………… 41 4.4.3 Country data …………...………………………………….…………….. 41 4.4.4 Aggregate (Eurostat) data .………………………………….………….. 42 4.4.5 Concluding remarks …....………………………………….…………… 42 References …………………………………………………………………………..... 43

Page 4: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

3

Handbook on Data Validation in Eurostat

1. INTRODUCTION A main goal of any statistical organization is the dissemination of high-quality information and this is particularly true in Eurostat. Quality implies that the data available to users have the ability to satisfy their needs and requirements concerning statistical information and is defined in a multidimensional way involving six criteria: Relevance, Accuracy, Timeliness and punctuality, Accessibility and clarity, Comparability and Coherence. Broadly speaking, data validation may be defined as supporting all the other steps of the data production process in order to improve the quality of statistical information. In the “Handbook on improving quality by analysis of process variables” (LEG on Quality project by ONS – UK, Statistics Sweden, National Statistical Service of Greece, and INE – PT) it is described as “the method of detecting errors resulting from data collection”. In short, it is designed to check plausibility of the data and to correct possible errors and is one of the most complex operations in the life cycle of statistical data, including steps and procedures of two main categories: checks (or edits) and transformations (or imputations). Its three main components are the following: • Data editing – The application of checks that identify missing, invalid or inconsistent

entries or that point to data records that are potentially in error. . • Missing data and imputation – Analysis of imputation and reweighting methods used to

correct for missing data caused by non-response. Non-response can be total, when there is no information on a given respondent (unit non-response), or partial, when only part of the information on the respondent is missing (item non-response). Imputation is a procedure used to estimate and replace missing or inconsistent (unusable) data items in order to provide a complete data set.

• Advanced validation – Advanced statistical methods can be used to improve data quality. Many of them are related to outlier detection since the conclusions and inferences obtained from a contaminated (by outliers) data set may be seriously biased.

Before Eurostat dissemination, data validation has to be performed at different stages depending on who is processing the data: • The first stage is at the end of the collection phase and concerns micro data. Member States

are responsible for it, since they conduct the surveys. • The second stage concerns country data, i.e., the micro-data country aggregates sent by

Member States to Eurostat. Validation has to be performed by the latter at this stage. • The third and last stage concerns aggregate (Eurostat) data before their dissemination and

it is also performed by Eurostat.

Validation should be performed according to a set of common (to what ? all sources and records? For one application or for all) and specific rules depending on the stage and on the data aggregation level. In this document, some general and common guidance are provided for each stage. More detailed rules and procedures can only be provided when looking at a specific survey, i.e., since each one has its own particular characteristics and problems. A thorough set of validation guidelines can only then be defined for a specific statistical project. Nevertheless, this document intends to discuss the most important issues that arise concerning validation of any statistical data set, describing its main problems and how to handle them. It lists as thoroughly as possible the different aspects that need to be analyzed for error diagnostic and checking, the most adequate methods and procedures for that purpose and

Page 5: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

4

Handbook on Data Validation in Eurostat

finally possible ways to correct the errors found. It should be seen as an introduction to data validation and provide references to further reading by any statistician or staff of a statistical organization working on this matter. That is, being the general starting point for data validation, this document may be applied and adapted to any particular statistical project or data set and may also be used as input for the “building block” for specific handbooks defining a set of rules and procedures common to Member States and Eurostat. In short, the text of this document should be regarded as the guidelines for general approach to data validation and should be followed by subsequent rules and procedures specifically designed for any statistical project and shared by Member States and Eurostat whose responsibilities also have to be clearly defined. In fact, the ultimate purpose should be the set-up of Current Best Methods (the description of the best methods available for a specific process) in validation for Member States and Eurostat, leading to efficiency gains and to an improvement in data quality as mentioned above. To this end, the introduction of new processes or of process changes, the adoption of new solutions and methods and the promotion of know-how and information exchange are sought. Therefore, the rules, procedures and methods should be discussed and recommendations provided that are not only based on strong statistical methodology but are also commonly used and widely tested in practice. The structure of this document is the following: the next sections discuss the three validation components mentioned above in that order, listing the main problems that may arise, providing some guidance for their detection and correction and indicating who should run validation at each stage. Some examples of validation procedures in surveys conducted in Member States, USA and Canada are also provided. They are only a few illustrative examples of the main rules and procedures used.

Page 6: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

5

Handbook on Data Validation in Eurostat

2. DATA EDITING Data validation checks – the data have to be checked for their correctness, consistency and completeness in terms of number and content, because several errors can arise in the collection process, such as: • The failure to identify some population units or include units outside the scope (under and

over coverage). • Difficulties to define and classify statistical units. • Differences in the interpretation of questions. • Errors in recording or coding the data obtained. • Other errors of collection, response, coverage, processing, and estimation for missing or

misreported data. The purpose of any checks is to ensure a higher level of data quality. It is also important to reduce the time required for the data editing process and the following procedures can help: • Electronic data processing – Data should be checked and corrected already when provided

by the respondents. Therefore, supplying data by electronic means should be encouraged (electronic questionnaires and electronic data interchange).

• Application of statistical methods – Faulty, incomplete, and missing data can be corrected by queries with the respondents but errors can also be corrected through the application of statistical models, largely keeping the data structure and still meeting the requirements in terms of accuracy and timeliness of the output.

• Continuous improvement of data editing procedures – For repeated statistics, data editing settings should be adjusted to meet changing requirements and knowledge from previous editing of statistical data should be taken into account to improve questionnaires and make data editing more efficient.

• Omitting editing and/or correction of data such that the change would have only negligible impact on the estimates or aggregates.

2.1 Literature review Although there is a large number of papers on data editing in the literature, the seminal paper by Fellegi and Holt (1976) is still the main reference, where these authors introduced the normal set of edits as a systematic approach to automatic editing (and imputation) based on set theory. Following these authors, the logical edits for qualitative variables are based on combinations of code values in different fields that are not acceptable. Therefore, any edit can be broken down into a series of statements of the form “a specified combination of code values is not permissible”. The subset of the code space such that any record in it fails an edit is called the normal form of edits. Any complex edit statement can be broken down into a series of edits, each having the normal form. Edit specifications contains essentially two types of statements: • Simple validation edits, specifying the set of permissible code values for a given field in a

record, any other value being an error. This can be converted into the normal form very easily and automatically.

• More complex consistency edits, involving a finite set of codes. These are typically of the form that whenever a record has certain combinations of code values in some fields, it should have some other combinations of code values in some other fields. Then, the edit statement is that if a record does not respect this condition on the intersection of

Page 7: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

6

Handbook on Data Validation in Eurostat

combinations of code values, the record fails the edit. This statement can also be converted into the normal form.

Hence, whether the edits are given in a form defining edit failures explicitly or in a form describing conditions that must be satisfied, the edit can be converted into a series of edits in the normal form, each specifying conditions of edit failure. The normal form of edits is originally designed for qualitative variables, but it can be extended to quantitative variables even though, for the latter, this is not its natural form. The edits are expressed as equalities or inequalities and a record that does not respect them for all the quantitative variables, fails the edit. A record which passes all the stated edits is said to be a “clean” record, not in need of any correction. Conversely, a record which fails any of the edits is in need of some corrections. The advantage of this methodology is that it eliminates the necessity for a separate set of specifications for data corrections. The need for corrections is automatically deduced from the edits themselves which will ensure that the corrections are always consistent with the edits. Another important aspect is that the corrections required for a record to satisfy all edits change the fewest possible data items (fields) so that the maximum amount of original data is kept unchanged, subject to the edit constraints. The methods and procedures described and discussed next as well as the proposed guidelines on data editing and correction fit into this model of normal form of edits as will become clear. 2.2 Some general procedures applied in Countries Data editing procedures depend on the specific data they concern. Therefore, as illustrative examples, we describe some of the main procedures applied by the Statistical Institutes. Error detection usually implies contact with the respondents leading to the correction of those errors. 2.2.1 Foreign Trade • Some responses can only be accepted if they belong to a given list of categories

(nomenclatures). Therefore, the admissibility of the response is checked according to that list (for example, delivery conditions, transaction nature or means of transport can only be accepted if they assume a category of the corresponding list).

• The combination of the values of some variables has to respect a set of rules. Otherwise, the value of one or several of those variables is incorrect.

• Detection of large differences between the invoice and the statistical values for those respondents who have to provide both values.

• Detection of large differences between the current period of time and historical data. • Detection of non-admissible invoice or statistical values, net weight, supplementary units

or prices. The detection is based on the computation from historical data of “admissibility intervals” for these variables at a highly disaggregated level.

• Detection of large differences between the response and the values provided by other sources, e.g., VAT data.

2.2.2 Industrial Output • Detection of large differences (quantities, values, prices, etc) between the response of the

current period t and the values in past periods (t-1) and (t-2). For infra-annual data, the differences between the response of the current period and the response of the same period

Page 8: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

7

Handbook on Data Validation in Eurostat

in the previous year are also checked. For example, for monthly data, the differences between the values in time t and (t-12) are checked, for quarterly data, the differences are between the values in time t and (t-4), etc.

• Detection of large differences between the response and those provided by similar respondents, namely those companies of the same industrial branch and/or in the same region and/or variables (quantities, values, prices, etc).

2.2.3 Commercial Companies Survey • Automatic checking of the main activity code with what?. • Coherence of the companies’ responses, mainly their balance sheets. Correction of small

errors is automatically carried out. • Coherence with the previous period is also checked. 2.2.4 Employment Survey • Error detection – The respondents are surveyed twice in the same period and the detection

of large differences between the two responses leads to the deletion of the first one, i.e., the second response is considered correct and the first is considered wrong.

• Error assessment – a global error measure may be computed from the comparison between the first and the second responses for every respondent. Therefore, for any given characteristic with k categories C1, C2,…,Ck, responses can be classified in the following table:

1st resp.

2nd resp.

1C 2C . . . jC . . .

kC

1C 11n 12n . . .

1jn . . . 1kn

2C 21n 22n . . .

2jn . . . 2kn

M M M M M M M

iC i1n i2n . . .

ijn . . . ikn

M M M M M . . . M

kC k1n k2n . . . kjn . . .

kkn

where nij represents the number of respondents classified in category Ci in the second response and in category Cj in the first response. If there are no errors in the n respondents correctly surveyed, only the elements in the main diagonal will not be zero. The global quality index is computed as

%100×=∑

n

nQI i

ii

.

If both responses agree for every respondent, we have QI = 100, and QI = 0 if they disagree for every respondent. This indicator is a global measure of the quality of the data in the entire survey, i.e., for every characteristic in the survey. It is also computed for every variable in the survey.

2.2.5 Private Sector Statistics on Earnings • Automated checking of different items concerning salaries and occupation, namely number

of employees, salary item averages and salary item average changes relatively to the previous year. What is being checked?

Page 9: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

8

Handbook on Data Validation in Eurostat

• Every item (such as the basic monthly salary) is subject to specific checking routines in order to detect errors such as negative salaries, values under the minimum salary, low or high salaries or other benefits and low or high growth rates.

• Data are also examined at different levels of aggregation: total level, industry level and company level. coherence check with aggregates or with t-1?

• If errors are found, data are analysed and corrected at the micro level. • Minimum and maximum values for each salary item are checked (and corrected if wrong).

Top n or p% ? 2.2.6 Survey of Business Owners • Data errors are detected and corrected through an automated data edit designed to review

the data for reasonableness and consistency. • Quality control techniques are used to verify that operating (collection ? processing? See

point below) procedures were carried out as specified. 2.2.7 Building Permits Survey • Most reporting and data entry errors are corrected through computerized input and

complex data review procedures. • Strict quality control procedures are applied to ensure that collection, coding and data

processing are as accurate as possible • Checks are also performed on totals and the magnitude of data What checks?) • Comparisons to assess the quality and consistency of the data series – The data and trends

from the survey are periodically compared with data on housing starts from other sources, with other public and private surveys data for the non-residential sector and with data published by some municipalities on the number of building permits issued.

2.3 Some general procedures applied in Eurostat Eurostat checks the internal and external consistency of each data set received from Member States (country data). The main checks and corrections concerning several statistical projects made by Eurostat after discussion with the Member State involved are as follows: • Ex post harmonization of national data to EU norms. • Data format checking. • Classification of data according to the appropriate nomenclature. • Rules on relationships between variables (consistency). • Non negativity constraints for statistics mirror flows. What is that? • Plausibility checks of data. What is that? • The balance checks like differences between credits and debits. • Aggregation of items and general consistency when breaking down information (e.g.

geographical, activity breakdowns). • Time evolution checking.

Page 10: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

9

Handbook on Data Validation in Eurostat

More precisely, different kinds of corrections can be envisaged. 2.3.1 Harmonization of national data It is necessary to ensure the comparability and consistency of national data. Statistical tables for each Member State can then be compiled and published based on the common Eurostat classification. To this end, Eurostat checks that the instructions to fill in the questionnaire have been followed by the reporting countries. When relevant differences relatively to the definitions are detected, Eurostat reallocates national statistics according to the common classification. This involves the followingverifications: • On the country and economic zone, to ensure that the contents of each country and

economic zone have been filled in the same way. • On the economic activity, to check if all the items (sub-items) have been aggregated in the

same way by Member States.

2.3.2 Corrections (deterministic imputation) using data from the same Member State • Corrections with direct data

– Correction of a variable using the difference between two others such as the net flows with credit and debit flows or flows for an individual item with flows of two other aggregated items.

– Correction of a variable using the sum of other variables such as flows for an aggregated item with individual given items.

– Correction of a variable using others such as flows for an aggregated partner zone with flows of other(s) partner zone(s).

– Correction of a variable by computing net amounts such as the flows of Insurance services with the available gross flows, i.e. by deducting from Gross flows, Gross claims received and Gross claims paid.

• Corrections (imputation using estimators?) with weighted structure

– Correction of flows for a given partner zone and a given year using an average proportion involving another partner zone and other years.

– Correction of flows for a given item and a given year using an average proportion involving another item zone and other years.

– Correction of flows for a given item and a given partner zone using an average proportion involving another item zone and another partner zone.

– Correction of flows for a given item using a proportion involving two other items.

2.3.3 Corrections (deterministic imputation) using data from other Member States • Corrections with direct data

– Correction of flows for partner zone intra-EU using available bilateral flows of main EU partners.

• Corrections with (imputation using estimators?) weighted structure

– Correction of flows for a given item and a given year using an average proportion involving a mixed item, other EU Member States and several years.

– Correction of flows for partner zone extra-EU using an average proportion involving partner(s) “intra-EU”, partner(s) “(intra-EU + extra-EU)” and other EU Member States.

– Correction of flows for a given partner zone and a given year using an average proportion involving another partner zone, other EU Member States and another year.

Page 11: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

10

Handbook on Data Validation in Eurostat

We next present examples of surveys where validation is performed by Eurostat. 2.3.4 Foreign Trade The data sets received by Eurostat are checked according to a set of rules the same rules as those applied by Member States, such as the following examples. • Checking for invalid nomenclature codes, i.e., some variables have to assume values of a

given list (nomenclature). • Checking for invalid combinations of values in different variables. • Detection of non-admissible values, i.e., checking if a variable is within a certain interval

range. 2.3.5 Transport Statistics Transport Statistics are available for Maritime, Air, Road and Rail transport modes. Some of the main checks are the following. • Checking the variables’ attributes such as data format, length, and type or nomenclature

codes. • Detection of non-admissible values. • Checking for invalid combinations and relationships of values in different variables. 2.3.6 Labour Force Survey The Labour Force Unit collects data for employment in the Member States. The main checks are as follows. • Checking the variables’ attributes such as data format, length, type or nomenclature codes. • Comparison of variables to detect eventual inconsistencies. 2.3.7 Eurofarm Eurofarm is a system aiming at processing and storing statistics on the structure of agricultural holdings that are derived from surveys carried out by Member States. Its main checks are the following. • Checking the variables’ attributes such as data format, length, and type or nomenclature

codes. • Checking for non-response. • Detection of non-admissible values. • Comparison of variables to detect eventual inconsistencies. 2.4 Guidance on data editing 2.4.1 Stages of data editing Before dissemination, data checking and editing may have to be performed at the three different validation stages mentioned in the introduction, depending on who is processing the data and the phase of the production process. • The first stage for error checking and correcting is the collection stage and concerns micro

data. In general, Member States (MS) are responsible for it, since they conduct the surveys, even when Eurostat receives this type of data.

• The second stage concerns country data, i.e., the micro-data country aggregates sent by Member States to Eurostat. Data checking at this stage has to be made by

Page 12: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

11

Handbook on Data Validation in Eurostat

Eurostat(presumably after thorough verification by the data source) and, if errors are detected, the data set could be sent back to the country involved for correction. If the sending back is not possible, Eurostat has to make the necessary adjustments and estimations.

• The third stage concerns aggregated (Eurostat) data before their dissemination and a last check has to be run by Eurostat since it might be possible that some inconsistency or errors in the data can be found only at this stage. This requires further corrections by Eurostat..

Since data editing and correction depend on the specific data, we propose several procedures that can be generally applied at each stage. The actual application should choose the appropriate procedures. 2.4.2 Micro data Validation checks on micro data should be run by Member States, i.e., when they send their data sets to Eurostat, these sets should have been scrutinized and error-free already. This also applies to those situations where Eurostat receives the micro data because MS conduct the surveys and therefore are closer to the respondents and can detect and correct errors more efficiently. In fact, as it will be discussed later, error correction very often requires new contacts with the respondents which can be done much more quickly and better by national statistical agencies. As mentioned above, it is important that the time required for the data checking and editing process is reduced and to this end automated data processing, application of statistical methods and continuous improvement of data editing procedures should be pursued. 2.4.2.1 Error detection Since checking and editing depend on the specific data concerned, we next propose some procedures that can be generally applied and adapted to any particular survey: • 2. Checking of the data sender, particularly for electronic submission. Example: foreign

trade statistics (Intrastat). • 3. Checking for non-responses – in many surveys, several respondents are known,

especially the largest or most important ones. If their responses are not received, it usually means that they failed to respond and this may have a significant impact on the final data. Thus, checking for missing responses is very important. Examples: in foreign trade, industrial statistics, or building permits survey, the most important respondents (companies in the former two cases and municipalities in the latter) are perfectly known by the national statistical organizations and if they fail to send their information, the impact on the final data may be very strong.

• 1. Checking of the data format – the data must have a predefined format (data attributes, number of fields or records, etc.). Example: foreign trade, industrial statistics or employment survey.

• 5. Detection of large differences between the invoice and the statistical values for those respondents who have to provide both values, or between the response and VAT data. Examples: foreign trade, industrial statistics.

• 4. Detection of non-admissible responses

– Checking of the response category of qualitative variables, since responses on this type of variables have to assume a category of a given list (nomenclatures). Therefore, only responses belonging to that list can be accepted. Examples: delivery conditions,

Page 13: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

12

Handbook on Data Validation in Eurostat

transaction nature, means of transport, gender, occupation, main activity sector such as industrial branch.

– Quantitative variables whose values cannot be outside a given range. Examples: salaries, income, sales, output, exports, imports, prices, weights, number, age, etc., have to be positive.

– Quantitative variables whose values have to be within a given interval. These “admissibility intervals” have to be computed from historical data at a highly disaggregated level. Examples: unit values or prices, unit weights, height of a building, age of a person, number of hours worked, income, etc., have to be inside a given interval of admissible values; salaries cannot be lower than the minimum salary, etc.

• 6 Detection of large differences between current and past values (growth rates). In particular, the value on time t (current value) should be compared with the values on time (t-1) and (t-2) for example and, for infra-annual data, with the corresponding period of the previous year, i.e., time (t-12) for monthly data, (t-4) for quarterly data and so forth.

• 7 Detection of large differences between the response and those provided by similar respondents. Examples: companies of the same industrial branch and/or in the same region and/or variables (quantities, values, prices, etc).

• 9 Outlier detection – the last four items are related with outlier detection which will be discussed in section 4.

• 8 Detection of incoherencies in the responses from the same respondent and error assessment, since there are usually relationships and restrictions among several variables. Examples: exports or imports and total output of the same company (these variables have to be coherent); coherence in a company’s balance sheets; age and marital status (for instance, a two-year old person who is a widow). When the respondents are surveyed more than once, coherence between the responses has to be checked (usually, this is the purpose of surveying the same respondent more than once). Large differences between the two responses require corrections and a global error measure as the QI statistic in the Employment survey mentioned above can be computed (error assessment). Low values of this indicator mean significant incoherencies requiring error correction.

The number and variety of data editing and checking procedures is very large since they depend on the specific data and country, thus requiring that the general procedures described above are adapted. Some categories, reference (admissible) values or intervals, however, are common to the different countries. 2.4.2.2 Error correction When errors are detected in the micro data, they have to be corrected which should be done by Member States, even in those cases where Eurostat receives these disaggregated data. Like error detection, correction procedures depend on the particular data and disaggregation level. Therefore, we discuss the main procedures that can be generally adopted: • Generation of the list of errors as a starting point for the correction process. The errors may

have attributes such as severity and size of impact. A score function (footnote to Latouche) can be used to assign the importance.

• Correction of the coding, classification or typing errors and other data attributes such as the format.

• Correction of those variables whose values can be obtained from other variables of the same respondent. Example: unit prices can be computed from the total value and the corresponding quantity.

Page 14: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

13

Handbook on Data Validation in Eurostat

• Contact with the respondents – most of the errors has to be solved through the contact with the respondents. Moreover, the values questioned are often correct and end up by being confirmed which can only be done by the respondents themselves, requiring such contact.

• Imputation of missing or erroneous data – in case the contact is not possible, too expensive or its outcome is not received on time, the values requiring correction have to be discarded from the data base, thus originating non-responses. These values in question will have to be imputed with methods as discussed in section 3.

These last two procedures are the main reasons why Member States should be in charge of validation of micro data, i.e., they should run validation at this stage even when Eurostat receives these data. In fact, if validation was performed by Eurostat, it would have to return the error list to the country involved for correction which is an important loss of efficiency and may jeopardize the deadlines for dissemination. Therefore, it is very important that validation is run by MS at this stage. Note that the procedures of editing and imputations should be as uniform (identical) as possible among all data sources. 2.4.3 Country data The country data received by Eurostat should already be validated at the micro level by the national statistical organizations. Nevertheless, some errors or problems can only be detected when data from the different countries are combined, compared or analysed, such as bilateral flows in foreign trade. When these errors are detected and the problem is significant, the correction should be made by Eurostat, consulting the country involved whenever possible. 2.4.3.1 Error detection As for micro data, checking and editing depend on the specific data concerned and therefore we propose some general procedures for error detection by Eurostat that can be applied and adapted to any particular survey: • 4 Detection of different definitions in national statistics – common definitions and

classifications in national statistics. The data sets supplied by Member States can only be compiled and published by Eurostat if they are based on the same (or able to map 1:1 or n:1) classification in order to ensure the comparability and consistency of national data. If divergences are found, they have to be corrected.

• 1. Checking the data format – the data must have a predefined format. • 2. Checking for incomplete data – checking whether the data are complete or there are

missing data. The more extreme situation is when a Member State does not send its data set at all. Other examples of partially missing data are when the country total is received, but not the regional breakdown or, in foreign trade, when the country total is received but not some or all the bilateral flows.

• 3 Checking the classification of variables – this classification has to follow the appropriate nomenclatures.

• 5 Changes in the definitions and classifications used – when the definitions and classifications adopted are changed (such as concepts, methodologies, surveyed population, data processing), the data will show the differences.

• 6 Detection of non-admissible values – the value of some variables has to be within a given range. For example, age, salaries, foreign trade flows, output or price indices cannot be negative; indices with values that make no sense, such as decimal values or values in the order of tens of thousands (with base 100).

Page 15: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

14

Handbook on Data Validation in Eurostat

• 8 Detection of large differences between the country’s current and past values (growth rates) – the current value should be compared with the previous values and, for infra-annual data, with the corresponding period of the previous year. The occurrence of such differences is usually caused by errors.

• 7 Detection of incoherencies among variables – there are often relationships and restrictions among variables that have to be satisfied. When they are not, the incoherence found has to be corrected. A very simple example, among many others, is that the balance has to equal the difference between credits and debits.

• 9 Search for breaks in the series, i.e., large jumps or differences in the data from a period to the next – these differences are probably caused by an error or by a change in the definitions and classifications adopted.

• 10 Large changes in the series length – if the number of observations in a data series supplied by a Member State suffers an important change, the reason for this difference has to be checked because it may be caused by error, or by changes in data processing, or by retropolation of the series, etc.

• 11 Aggregated items correspond to the sum of sub-items – when the country provides the breakdown of a given data, the total has to equal the sum of the parts. Similarly, when a country provides different breakdowns of the same data, such as turnover in companies by region and activity, the total of the two breakdowns has to be the same.

• 12 Cross checking with other sources – the data from a given country should be checked for coherence with other data from the same country or with data from another country. If differences are found, they have to be investigated and corrected. For example, industrial and foreign trade statistics from the same country; in foreign trade statistics, the bilateral flows from a country should be checked with the corresponding bilateral flows from its partners (mirror statistics).

2.4.3.2 Error correction When errors are detected in country data, they have to be corrected by Eurostat, possibly after discussion with the national statistical organization involved. Like error detection, correction procedures depend on the particular data and consequently we discuss the main procedures that can be generally adopted, taking into account that the corrections performed at Eurostat described above are appropriate. • Harmonization of national data – if significant discrepancies arise in the statistics of a

given country because of relevant differences relatively to the definitions (concepts and classifications), Eurostat has to check whether the instructions to fill in the questionnaire have been followed by the reporting countries and ask the country to recompute the national statistics according to the common definition or classification. This involves the following steps: – On the country and economic zone, to ensure that the contents of each country and

economic zone have been filled in the same way. – On the economic activity, to check if all the items (sub-items) have been aggregated in

the same way by Member States. Moreover, the statistical agency of the country involved has to correct the problem in the future, i.e., it has to stop using its own definitions and classifications and start using those set up by Eurostat.

• Correction of the data format and variable classification – this may require a considerable programming and computational effort for large data sets, thus being time consuming. If

Page 16: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

15

Handbook on Data Validation in Eurostat

classification of the data is wrong, they have to be regrouped based on the correspondence between the two classifications (nomenclatures) used.

• Changes in the definitions and classifications used – a warning has to be issued about those changes and when they occurred. Retropolation of the series should be computed based on the new definitions, if possible.

• Imputation of incomplete data – when part or the whole data set is missing, Eurostat should first try to make the Member State send the missing data. If this is not possible in a timely manner, it is equivalent to a non-response (total or partial) and Eurostat has to impute it (with the methods discussed in section 3). The solution of flagging it as “Non-available” is inadequate and should be avoided.

• Correction of non-admissible values – when this type of errors occurs, it may be possible to determine the correct values by using other variables in the same or in other data sets. If this is not possible, the non-admissible values have to be imputed with the methods discussed in section 3.

• Correction of large differences relatively to the country’s past values – Eurostat should first try to make the Member State involved to correct or confirm the values leading to such differences in a timely manner. However, if it is not possible, the correction has to be made by Eurostat. This issue is related to outlier detection and correction discussed in section 4. Nevertheless, some errors can be corrected (by Eurostat) with methods like the following that are very straightforward and easy to apply. – Corrections using data from the same Member State. Examples: correction of a variable

using the difference of two others (such as net flows with the positive and negative flows), or the sum of others (such as flows of an aggregated item with the individual items); correction of a variable using the net amounts, i.e., by comparing the available net amounts with the result of computing those amounts from the difference of the variables involved; likewise for sums; correction of a variable using others (such as flows for an aggregated partner zone with flows of other partner zones).

– Corrections using data from other Member States. Examples: correction of intra-EU flows of a Member State using the available bilateral flows of its partners; correction of extra-EU flows of a Member State using published data from other sources (such as OECD, IMF or UN) with extra-EU bilateral flows to or from that Member State.

Note also that these simple procedures can also be applied to the correction of the previous two items, namely the imputation of missing data and the correction of non-admissible values, which is very straightforward.

• Correction of incoherencies among variables – when incoherencies are found, they have to be corrected. It is sometimes possible to correct them by using other variables from the same country, such as computing the balance from the difference between credits and debits, or the first set of examples in the previous item. In other situations, data from another country has to be used, such as the second set of examples in the previous item. When such corrections are not possible, Eurostat has to impute the values of the incoherent variable(s) by using the methods of section 3.

• Series breaks – if they are caused by error, it has to be corrected. If they are caused by other factors, such as changes in the definitions or classifications used, these changes have to be flagged or the data have to be recomputed with the previous parameters. If this is not possible on time, it is preferable to impute the values after the break(s) with the methods of section 3 and correct them later.

• Series length – if it changes because of error, Eurostat should return to the old series. Otherwise, the change should be flagged.

Page 17: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

16

Handbook on Data Validation in Eurostat

• Correction of incoherencies in the aggregation of data – if aggregated variables do not correspond to the sum of their parts in the breakdown, the former, i.e., the aggregate has to be corrected. If two different breakdowns of the same data do not have the same total, the parts of each breakdown have to be checked and corrected and it is possible that some of them have to be imputed (section 3).

• Correction of incoherencies with other sources – if differences between alternative sources are found for the data of a given country, they have to be corrected by using the most reliable source. Sometimes, the highest value is chosen from the alternatives. For example, in foreign trade statistics, when the bilateral flows between two Member States do not agree, the highest value should be used and the appropriate corrections made to the total flows of the partner that had the smallest value (mirror statistics).

2.4.4 Aggregate (Eurostat) data The data sets received from the Member States have to be scrutinized and error-free before their dissemination and the two previous editing and correction stages should be sufficient to this purpose. However, some problems or inconsistencies may become apparent only when aggregate (Eurostat) data are computed, such as growth rates, European aggregates, or bilateral flows with other geographical zones or economic entities. Moreover, the aggregate values computed for different geographical zones have to correspond to the aggregation (sum) of the countries involved. Another issue, particularly important for dissemination purposes, is that the figures published by Eurostat have to compatible with national statistics. When such problems are detected, their cause has to be identified and corrected at the country level since simply discarding the data received from a country (or several countries) is not an adequate solution because it provides no information on that (those) country(ies) and prevents the computation of Eurostat aggregates. Consequently, that solution should not be considered as an option and we are back to the previous stage of editing and correction which means that the same methods described above for country data apply here. This is the final stage where these methods can be applied and, if no correction is possible on time for dissemination, imputation should be performed (section 3). It is preferable to use imputed data (assuming that the imputation method used is appropriate) than a wrong value or no value at all. After this final stage of corrections is complete, the data are ready for dissemination. 2.4.5 Concluding remarks Error detection and correction in Eurostat statistical data may be performed at each of the three stages of the production process: at the micro (collection) level, at the country level and at aggregate (Eurostat) level. Moreover, it should be performed at the earliest stage possible. In the ideal situation, each stage should seek the complete detection and correction of any errors, leaving as few problems as possible to be solved later, because the earlier the detection, the more accurate the correction can be. This will simplify the task of the following stages achieving a higher quality and speeding up the process of data production and dissemination. State Members are responsible for the first stage and Eurostat for the other two. Nevertheless, the latter should play an important role in the coordination and harmonization of editing procedures by the former. The checks and corrections applied depend on the stage and on the data set under scrutiny. The more efficient the detection and correction procedures are, the higher the quality of the data and the better those inferences will be. Quality assessment can be made by comparing the corrected values with the corresponding revised data that will be obtained later. To this end, accuracy measures such as the mean squared error or the QI statistic in the employment

Page 18: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

17

Handbook on Data Validation in Eurostat

survey may be calculated. It is also important to keep a record of the errors detected, their sources and the corrections required in order to avoid the former and to improve and speed up the latter in the future.

Page 19: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

18

Handbook on Data Validation in Eurostat

3. M ISSING DATA AND IMPUTATION Missing data caused by non-response is a source of error in any data set requiring correction. To this end, imputation methods can be used in order to fill those gaps and provide a complete data set. Non-response errors are often the major sources of error in surveys and they can lead to serious problems in statistical analysis. It is usual to distinguish missing data caused by unit non-response (total non-response) and missing data caused by item non-response (partial non-response). The former is usually corrected by imputation whereas the latter is usually dealt with by reweighting. 3.1 Literature review The literature on Imputation of missing data is vast and covering it thoroughly is far beyond the scope of this document. Nevertheless, we briefly discuss the main and most commonly used methods, including in Eurostat and in Member States. The main references in this field are Lehtonen and Pahkinen (2004), Little and Rubin (2002), which we will follow closely, and Rubin (2004). Moreover, time series models can also be used and it is in fact a valid, useful and easy to implement approach to this problem. However, since they are another class of methods and a different perspective, totally based on historical data, we will not consider them here. There are two main classes of methods: single imputation methods, where one value is imputed to each missing item, and multiple imputation methods, where more than one value is imputed to allow the assessment of imputation uncertainty. Each method has advantages and disadvantages, but discussing them is beyond the scope of this document. Such a discussion is in the references mentioned above or in Eurostat Internal Document (2000). We start by describing the former methods. 3.1.1 Single imputation methods There are two generic approaches to single imputation of missing data based on the observed values, Explicit and Implicit modeling and they will be briefly described next. 3.1.1.1 Explicit modeling Imputation is based on a formal statistical model and hence the assumptions are explicit. The methods included here are discussed next. 3.1.1.1.1 Mean imputation • Unconditional mean imputation – The missing values are replaced (estimated) by the mean

of the observed (i.e., respondent) values. • Conditional mean imputation – Respondents and non-respondents are previously classified

in classes (strata) based on the observed variables and the missing values are replaced by the mean of the respondents of the same class.

In order to avoid the effect of outliers, the median may be used instead of the mean. For categorical data, the mode is used for the imputation.

Page 20: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

19

Handbook on Data Validation in Eurostat

3.1.1.1.2 Regression imputation • Deterministic regression imputation – This method replaces the missing values by

predicted values from a regression of the missing item on items observed for the unit. Consider 1k1 X,,X −K fully observed and kX observed for the first r observations and missing for the last rn − observations. Regression imputation computes the regression of

kX on 1k1 X,,X −K based on the r complete cases and then fills in the missing values as

predictions from the regression, Suppose case i has ikX missing and 1k,i1i X,,X −K

observed. The missing value is imputed using the fitted regression equation

1k,i1k1i10ik XˆXˆˆX̂ −−β+β+β= K (3.1)

where 0β is the intercept (which may be zero, leading to a regression through the origin)

and 1k1 ,, −ββ K are respectively the regression coefficients of 1k1 X,,X −K in the regression

of kX on 1k1 X,,X −K based on the r complete cases (estimated parameters or predicted values of a variable are denoted by a ^). Note that if the observed variables are dummies for a categorical variable, the predictions from regression (3.1) are respondent means within classes defined by that variable and this method reduces to conditional mean imputation. The above regression equation has no residual (stochastic) variable and therefore this method is called deterministic regression imputation.

• Stochastic regression imputation – It is a similar approach to the previous one, but a residual random variable is added to the right-hand side of the regression equation. Consequently, instead of imputing the mean (3.1), we impute a draw:

ik1k,i1k1i10ik UXˆXˆˆX̂ +β+β+β= −−K (3.2)

where ikU is a random normal residual variable with mean zero and variance 2σ̂ which is

the residual variance from the regression of kX on 1k1 X,,X −K based on the r complete cases. The addition of the random normal variable makes the imputation a draw from the prediction distribution of the missing values, rather than the mean. If the observed variables are dummies for a categorical variable, the predictions from regression (3.2) are conditional draws (instead of conditional means as in regression 3.1).

3.1.1.2 Implicit modelling The focus is on an algorithm, which implies an underlying model. Assumptions are implicit, but it is necessary to check if they are reasonable. 3.1.1.2.1 Hot deck imputation This is a common method in survey practice. Missing data are replaced by values drawn from similar respondents called “donors” and there are several donor sampling schemes. Suppose that a sample of out of N units is selected and r out of the n sampled values of a variable X are recorded. The mean of X may then be estimated as the mean of the responding and the imputed units:

( )n

XrnXrX NRR

HD

−+= (3.3)

where RX is the mean of the respondent units and

∑−

==

r

1i

iiNR rn

XHX (3.4)

Page 21: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

20

Handbook on Data Validation in Eurostat

where Hi is the number of times Xi is used as a substitute for a missing value of X, with

∑ −==

r

1ii ,rnH the number of missing units. The imputation procedure brings in a new source

of uncertainty in the estimation and consequently an increase in the variance of the estimator of the mean. The advantage of the hot-deck method is that imputed values do not distort the distribution of the sampled values of X the way mean imputation does. We next describe some of the commonly used donor sampling schemes. • Hot deck by simple random sampling with replacement – The numbers Hi are obtained by

random sampling with replacement from the recorded values of X. The increase caused by imputation in the variance of the estimator of the mean is not negligible.

• Hot deck within adjustment classes – Adjustment classes (strata) may be formed and missing values within each class (stratum) are replaced by recorded values sampled at random in the same class. That is, the missing items for the non-respondent are replaced by the respondents’ values belonging to the same class.

• Nearest neighbour hot deck – This method is based on the definition of a metric to measure the distance between units, based on the values of covariates. To impute a missing value, we choose the donor that is closest to the unit with that missing value (the “nearest

neighbour”). For example, let ( )TiK1ii Y,,YY K= be the values of K appropriately scaled

covariates for a unit i for which Xi is missing. If these variables are used to form adjustment classes, the metric

( )

=classesdifferentinj,i1

classsameinj,i0j,id

leads to the previous method. Other possible metrics are the maximum deviation

( ) ,YYmaxj,id jkikk

−= or the Mahalanobis metric, ( ) ( ) ( ) ,YYSYYj,id Tji

1yy

Tji −−= − where

Syy is an estimate of the covariance matrix of Yi. The metric need not be full rank in the sense of only giving zero distance when ( )j,i is such that .YY ji =

• Sequential hot deck – responding and non-responding units are treated in a sequence and a missing value is replaced by the previous responding value in the sequence. For example, if 6n = , ,3r = X1, X4 and X5 are present and X2, X3 and X6 are missing, then X2 and X3 are replaced by X1 and X6 is replaced by X5. If X1 is missing, a starting value is necessary, possibly chosen from records in a previous survey.

3.1.1.2.2 Substitution This is a method for dealing with unit non-response at the fieldwork stage of a survey. It replaces non-responding units with alternative units not selected for the sample. For example, if a household cannot be contacted, it may be replaced by a previously non-selected household in the same housing block. The main problem with this procedure is that the units used for the replacement may differ systematically from non-respondents. Therefore, at the analysis stage, substituted values should be regarded as imputed values. 3.1.1.2.3 Cold deck imputation A missing value is replaced by a constant value from an external source, such as a value from a previous realization of the same survey.

Page 22: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

21

Handbook on Data Validation in Eurostat

3.1.1.2.4 Composite methods It is also possible to combine procedures from different methods. For example, hot deck and regression imputation can be combined by calculating predicted means from a regression and then adding a residual randomly chosen from the empirical residuals to the predicted value when forming values for imputation. Another possible combination is hot deck with adjustment classes and nearest neighbour hot deck or sequential hot deck, where the donor is chosen within the previously defined classes. 3.1.2 Multiple imputation methods In multiple imputation methods, each missing value is replaced by several imputed values leading to the creation of the same number of complete data sets, i.e., replacing each missing value by the first imputed value creates the first complete data set, replacing each missing value by the second imputed value creates the second complete data set and so forth. Then, standard complete-data methods are used to analyse each data set and the several inferences obtained can be combined to form one inference that properly reflects uncertainty caused by non-response under the model assumed for imputation. Imputing a single value, as in single imputation methods, treats that value as known and thus without special adjustments; moreover, single imputation cannot reflect sampling variability under a model for non-response or uncertainty about the correct model for non-response. Multiple imputation shares the advantages of single imputation and corrects both disadvantages. The only disadvantage is that it is computationally more demanding and implies more work to create the imputations and analyse the results. This extra work, however, is modest for the computing resources available today since it involves executing the same task several times instead of only once. The analysis of data sets with multiple imputation is direct. In fact, each data set completed by imputation is analysed with the same complete-data methods that would be used in the

absence of non-response. Let tθ̂ , ,T,,1t K= be the T complete-data estimates of a parameter

θ calculated from T repeated imputations for each missing value. The combined estimate is

∑θ=θ=

T

1ttt .ˆ

T

1 (3.5)

Averaging over T imputed data sets in (3.5) increases the efficiency (decreases the variance) of the estimate over single imputation methods. 3.2 Some general procedures applied in Countries Imputation methods depend on the specific survey and type of data. However, there is no imputation of non-response in many surveys, i.e., missing values are discarded from the data and from the analysis or their use is deferred until a response is received. This type of surveys is not mentioned here, we focus on those where imputation is performed. Subjective imputation is exceptionally allowed when it is possible to estimate the missing value from other sources such as, for example, estimating a company’s product output from its exports of the same product. We start by describing some of the methods used in different countries. 3.2.1 Foreign Trade A possible approach is based on the assumption that the behaviour of non-respondents and respondents is similar and the growth rate of the latter is applied to the whole universe. In practice, the respondents’ growth rate relatively to the same period of the previous year is

Page 23: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

22

Handbook on Data Validation in Eurostat

computed and applied to the total data of that period (the same period of the previous year). This procedure implicitly corresponds to using deterministic regression and nearest neighbour hot-deck imputation even though no specific metric is applied to select the donors. The selection is based on the assumption above which means that the donors are “close” or “similar” respondent units (companies). In the following periods, imputed values are replaced by the actual data received, decreasing or eliminating the missing values. 3.2.2 Industrial Output • Infra-annual data – A similar approach is a deterministic-regression type of method. The

missing value of a given product i in unit (company) j in time t is replaced by the imputed

value t,j,iX̂ such as

)1m(t,j,i

1t,j,imt,j,it,j,i X

XXX̂

+−

−− ×=

where m is the annual frequency of the data, i.e., 12m = for monthly data, 4m = for quarterly data, etc. Therefore, in time t the growth rate of time 1-t relatively to the same period of the previous year 1)-m-(t is computed – for example, for monthly data, it is the rate from month 13-t to month 1-t ; for quarterly data, it is the rate from quarter 5-t to quarter 1-t , etc. The imputed value for time t is then calculated by applying that rate to the value of the same period of the previous year, i.e., to the value in time m-t .

• Annual data – This method is of the same type of the previous two methods. Like foreign trade data, it combines the nearest neighbour and the deterministic regression methods. In fact, the missing value of a given product i in unit (company) j in year t is replaced by the

imputed value t,j,iX̂ such that

∑×=

∈−

∈−

Rj1t,j,i

Rjt,j,i

1t,j,it,j,i X

XXX̂ (3.6)

where R denotes the set of the respondents in both years involved (t and t-1). Note that the fraction in (3.6) is the annual growth rate of the total (respondent) output of product i, i.e., the output of all the respondent companies that manufacture product i. Therefore, the imputed value in year t is calculated by applying this rate to the value of the previous year of the missing unit. For the calculation of this growth rate, the quartiles of turnover are computed and (four) classes are formed by companies between two quartiles (assuming that the values turnover values are available). Expression (3.6) is then applied within each class, i.e., the growth rate used in the imputation procedure is based on the respondent units belonging to the same class of the non-respondent. This provides a more homogeneous and reliable estimation because of the different unit sizes.

3.2.3 Commercial Companies Survey Imputation of non-responses is based on a combination of the hot deck within adjustment classes and the nearest neighbour hot deck methods, where the donor company is selected among those with the same activity (sector) and number of employees which forms the classes. It is also possible to use the cold deck method where imputation is based on the information provided by the non-respondent in the previous survey or from other sources. This is the procedure adopted for larger companies.

Page 24: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

23

Handbook on Data Validation in Eurostat

3.2.4 Employment Survey Being a predominantly categorical (qualitative) survey, unit non-response is dealt with by reweighting, i.e., by the application of a correction factor for non-response in the initial weight of each unit. This correction should be such that the missing units are represented by others with the same or at least very similar characteristics. Missing data caused by item non-response is usually not imputed, since most of the times “No answer” or “Unknown” are options in the survey. However, response is usually mandatory in most of the variables inquired. 3.2.5 Annual Survey of Hours and Earnings Missing data on employees is imputed by using hot deck within adjustment classes, where the donors are other employees with similar characteristics concerning the variable(s) to be imputed to the employees with the missing information, forming ‘imputation classes’. The choice of imputation classes is based partly on the results of the analyses completed to determine optimal stratification supplemented with variables that are relevant to pay. The resulting imputation classes are determined by variables such as occupation class, region, gender and age group. In other surveys of this type, like the Private Sector Statistics on Earnings, no data imputation is done.

3.2.6 Survey of Business Owners Missing data in several variables such as gender, origin and race are imputed from donor respondents with similar characteristics (location, industry, employment status, size, and sampling frame), i.e., the imputation method used is a combination of hot deck within adjustment classes and nearest neighbour hot deck. 3.2.7 Building Permits Survey • Unit (total) non-response – Based on the nearest neighbour hot deck method, missing data

are imputed for municipalities that fail to send in their reports for the current period. The data are calculated automatically, subject to certain constraints, by applying the month-to-month and year-to-year variations in similar values of responding municipalities and the historical pattern of the missing municipalities to the previously used values. At the end of the year, the imputed values are replaced with actual data received from late-reporting municipalities and final estimates are produced. If the actual data are not received, current values that have been imputed are assigned a value of 0 to replace the imputed data.

• Item (partial) non-response – When partial data are received (for example, the value of a project is missing), the missing characteristics are imputed on the basis of the average values for similar projects in the municipality's area, i.e., the conditional mean imputation method is used where the classes are formed from similar projects in the municipality’s area.

• No imputation is done for some variables such as permit undervaluation or failure to apply for a permit for construction work

3.2.8 Housing Rents Survey • Unit (total) non-response – There is no imputation for unit non-response in any period of

the year (such as a month or a quarter), i.e., the missing value is not estimated and is discarded. If the non-response occurs more than two consecutive times in a year, the

Page 25: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

24

Handbook on Data Validation in Eurostat

method used is substitution where non-respondent units are replaced by others in the same stratum.

• Item (partial) non-response – The method applied is conditional mean imputation, since missing values are imputed with the mean of their stratum.

3.2.9 Basic Monthly Survey Imputation is performed using a combination of the hot deck within adjustment classes and nearest donor hot deck methods, whereby a response from another sample person with similar demographic and economic characteristics is used to impute the non-response. The imputation procedure is performed one item at a time.

3.3 Some general procedures applied in Eurostat Imputation is performed in most statistical data provided by Member States and also in some surveys conducted by Eurostat such as the Community Innovation Survey (CIS2), the Continuing Vocational Training Survey (CVTS) and the European Community Household Panel (ECHP). We describe the main procedures adopted in the latter surveys (Eurostat Internal Document, 2000) as good examples of imputation procedures applied by Eurostat. 3.3.1 Community Innovation Survey A strong effort is recommended in order to minimize non-response. Imputation of missing data is based on the information from related variables. • Quantitative variables – The method applied is conditional mean imputation, since missing

values are imputed with the mean of their stratum. In fact, the method is composed of two steps. In the first step, outliers are discarded from the sample to calculate the mean. Then, in the second step, imputation is performed only if the proportion of missing values in the stratum is smaller than 50% (without outliers, i.e., after removing them from the sample), otherwise the stratum is merged with a neighbour size class in the same NACE class and this procedure takes place until the condition is satisfied. Finally, the mean of the stratum or of the regrouping chosen is imputed to the missing data. This procedure is first implemented within NACE divisions. If the proportion of missing values is still higher than 50% when all size classes have been grouped together within one or more NACE divisions, imputation is not performed and is implemented within a subsection of NACE. If the condition is still not satisfied, missing data are estimated by implementing the method on the whole population.

• Qualitative (ordinal and nominal) and percentage variables – A combination of the hot deck within adjustment classes and the sequential hot deck methods is used for imputation. First, the unit response is partitioned into classes of similar elements based on auxiliary variables (economic activity and number of employees). A missing value is replaced by the nearest preceding responding value to that item in the same class. More specific methods are also considered according to the nature of the variable with the missing observation. – Ordinal variables – the method considered here is nearest neighbour hot deck where

imputation is based on quantitative variables which are first classified in interval classes, i.e., turned into ordinal variables. When the observation of an ordinal variable is missing, a donor is found for imputation by looking for the nearest neighbour based on the classes of the quantitative variables and the imputed value is the donor’s value of that ordinal variable.

Page 26: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

25

Handbook on Data Validation in Eurostat

– Nominal variables – the same procedure is used, but both quantitative and ordinal variables are used to construct imputation classes .

– Percentage variables – it is again the same procedure, but quantitative, ordinal and nominal variables are used to construct imputation classes

3.3.2 Continuing Vocational Training Survey Imputation is allowed only when some conditions are satisfied in terms of the percentage of responses. The methods adopted are different according to the nature of the missing data. • Quantitative variables – A combination of the hot deck within adjustment classes and the

nearest neighbour hot deck methods is used, since data from other companies in the same sector and size group is used for imputation.

• Qualitative variables – Similarly to quantitative variables, companies are grouped into adjustment classes first. Then, a combination of the hot deck within adjustment classes and the hot deck by simple random sampling or the sequential hot deck methods is considered.

3.3.3 European Community Household Panel Imputation is only performed for the most important variables and the method depends on their type. • Discrete variables – The imputation method is hot deck within adjustment classes. • Continuous variables – stochastic regression imputation or hot deck within adjustment

classes. 3.4 Guidance on data imputation 3.4.1 Stages of imputation of missing data Imputation of missing data may have to be performed at each of the three validation stages displayed in table 1 where it is also shown who is responsible for it. • The first stage concerns microdata and should be performed by Member States after the

collection stage, since they are responsible for survey conducting, even when Eurostat receives this type of data. For example, if a company, a household, a person or a municipality do not respond to a question (item non-response) or do not respond to the survey (unit non-response), substitution or imputation have to be performed by the country’s statistical agency. This prompt action will speed up and lead to a better quality of the adopted procedures.

• The second stage concerns country data, i.e., the country aggregates. Failure by a Member State of sending a data set to Eurostat is equivalent to missing data and imputation has to be performed by the latter. A country’s non-response may be flagged as “Non-available”, but such a procedure would prevent the computation of some aggregates (Eurostat). Therefore, imputation is necessary here.

• The third stage concerns aggregate (Eurostat) data before their dissemination. Normally, no imputation will be necessary here, but it is possible that some inconsistency in country data is found only at this stage requiring its replacement by some imputed value. The country involved will hardly be able to provide corrected data or an imputed value on time for dissemination, implying that imputation has to be performed by Eurostat. This final check before the data are made public may show some problems previously undiscovered.

The imputation methods used have to take into account the data type and consequently the stage of their application. We discuss the main procedures that can be applied in each stage.

Page 27: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

26

Handbook on Data Validation in Eurostat

3.4.2 Micro data At the data collection stage, imputation of missing values should be performed by Member States, even when Eurostat receives the micro data. For a particular survey, the same or similar imputation methods should be used by all Member States and Eurostat can play an important role to achieve harmonization of procedures among the different countries. As discussed above, several imputation methods may be applied to any given situation where missing data arises. The choice of the most appropriate method depends on the particular survey and type of missing data. Therefore, there are only general gudelines: • Substitution of non-respondents is a possibility. However, the sampling of the

replacements should be careful and it can only be done by the statistical organization conducting the survey. High rate is a potential source of bias.

• Mean imputation has the drawback that it distorts the empirical distribution of the sampled values since it imputes values at the center of that distribution but the impact depends on type of estimates.

• Stochastic regression imputation is a common procedure, but it depends on the availability of other information to run the regression. Moreover, stochastic regression can avoid the distortions from imputing the mean of the predictive distributions that arise from the deterministic approach.

• Hot deck methods are also frequently encountered with its simplest implementation: the nearest neighbour method.

• These methods are sometimes adjustment classes within adjustment classes which allow taking into account intra-classes homogeneity.

• Multiple imputation is more complex to implement but might allow for more correct inference under imputation mechanism if distributional assumptions are met.

• Cold deck imputation is also a possible approach, for instance, using information from previous realizations of the same survey. However, some care should be taken to take into account changes over time.

• Finally, the combination of several procedures of different methods is usually required, i.e., composite methods are usually necessary. The best choice has again to be made by the entity conducting the survey, although common procedures and rules should be adopted for the same survey. Eurostat should play an important role here.

3.4.3 Country data When Member States send their data sets to Eurostat, it is possible that a part or the whole data set from a MS is missing, thus requiring imputation. The former situation is, for example, when a MS sends the country aggregates of foreign trade statistics, but fails to send the data on the bilateral trade with a given partner. The latter situation means that the country involved did not send its data to Eurostat. In both cases, Eurostat has to impute the missing data or alternatively, flag them as “Non-available” which is not convenient and may even be inadequate because it may prevent the calculation of some Euro aggregates. Therefore, imputation methods have to be applied here and again the appropriate method for any given situation depends on the statistical project. Some general suggestions are as follows. • Obviously, substitution of non-respondents is not an option here.

Page 28: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

27

Handbook on Data Validation in Eurostat

• Mean imputation should also be avoided. Replacing a country’s missing value with the mean of the respondent countries may lead to serious distortions. Foreign trade or Consumer prices are good examples where this procedure can be extremely incorrect. The same remarks apply to imputation of the missing part of a country’s data set.

• Stochastic regression imputation might be difficult to apply. If a country data is missing (the country fails to respond), it is possible to impute this value based on a regression with data from other countries or from other projects of the same country, but the predictions generated from the fitted model have to be carefully checked with historic data before this procedure can be applied. In fact, it is easier to obtain wrong predictions if the selected regressors are not suited for this purpose. If only part of the country’s data set is missing, this procedure may be easier to apply, because it may be easier to find appropriate variable for the regression model, but still caution has to be applied here. In both cases, time series or time regression models may be a good alternative (discussion of these methods is beyond our scope and can be seen in the literature).

• Hot deck methods are more straightforward and easier to apply in this situation. When the total country data is missing, sequential or random sampling hot deck methods make no sense, since it is not correct to pick a country at random or to choose the preceding responding country to replace the missing value of another. This “blind” procedure may easily lead to the imputation of a totally inadequate value – clear examples are foreign trade or industrial statistics, where replacing the (missing) value of a small country by the value of a much larger country makes no sense. Therefore, the nearest neighbour method seems in fact to be the best method. However, one has to be careful with its application, since the “nearest” neighbour may still provide values far from the missing data. Checking this procedure with historical data is required. On the contrary, there is a situation where this is the most correct method and should be applied – it is the case of foreign trade statistics, where the flows from or to a given country can be computed from the bilateral flows with its partners. This procedure is simple and correct and should be applied in situations of this kind (we have already mentioned this procedure in the previous section). If only part of the data set from a country is missing, nearest neighbour hot decking is also recommended, either from other data from the same country or from another country. The above example with foreign trade data is also valid here.

• As for micro data, these methods should be applied within adjustment classes. This is particularly important here since we are dealing with data from several countries and their different sizes may lead to totally distorted imputations if the donor is not carefully selected. That is, only countries of the same type (class) should be considered as donors and this issue depends crucially on the statistical project at stake. When only part of a country’s data is missing, definition of adjustment classes for imputation purposes is also important and it may be done either based on data from the same country or on data from other countries (or even a combination of both).

• Stochastic methods are usually better than deterministic ones, but the distribution of the stochastic residual used for imputation has to be carefully selected, namely its variance. In fact, this parameter should be estimated from data coming from similar countries, i.e., using adjustment classes is crucial here. Otherwise, the imputation results will most likely be severely distorted. In some situations, adding a stochastic residual makes no sense – it is the case of imputing the value of a flow of a country’s foreign trade with the bilateral trade with the partner involved. In such situations, deterministic imputation is obviously the right choice.

Page 29: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

28

Handbook on Data Validation in Eurostat

• Multiple imputation is a good choice, although it may be not be very easy to implement for country data. In fact, finding several appropriate donors may not be easy, since the number of countries involved is not very large and the country-donors’ characteristics have to be very similar to the non-respondent country. Nevertheless, when it is possible, even if only a few donors are found, this approach may be adequate and improve the quality of imputation. If part of a country’s data set is missing, multiple imputation may be easier to apply, either based on other data from the same country or on data from another one (or a combination of both). For example, several partners can be selected for imputation, or the output of several other industrial sectors.

• In some situations, the cold deck method can also be used, where information from other sources or from previous realizations of the same survey are used for imputation. This approach has to be carefully applied, since those “other sources” may not be adequate for the purpose of imputation (another country, for example) or, if that information is obtained from previous realizations of the same survey, it is important to take into account that variables evolve over time and the value selected for imputation may be very different from the missing value.

• As for micro data, it is usually necessary to combine several of the methods available (composite methods). For example, in order to impute country data, similar countries have to be used as donors, i.e., application of any method has to be done within adjustment classes because mixing together countries with different characteristics may easily distort the imputation performed. This very important problem applies to almost every statistical project and has been mentioned above several times.

3.4.4 Aggregate (Eurostat) data Before dissemination, data sets have to be complete, i.e., free of missing data. Therefore, the necessary imputation of missing values has to be fully performed and no more corrections should be needed at this stage. However, inconsistencies, discrepancies or other problems may become apparent only when country data are aggregated together (growth rates or flows to or from other geographical zones, for example). This issue has a strong connection with data checking and editing (previous section). When these problems are detected, their source has to be identified by Eurostat and it is possible that, in order to correct them, the data sent by a country (or by several) have to be discarded, preventing the computation of Eurostat aggregates and their dissemination. Therefore, in order for the data to be made available, adjustment/imputation is required. This is the final stage where such a procedure can be put in practice by Eurostat and the same methods described above for country data apply here.. After this final adjustment is complete, Eurostat aggregates can be computed. 3.4.5 Concluding remarks Missing data is a problem that can arise in any of the three stages of data production and level of aggregation: at the micro-data level, country level data and aggregate (Eurostat) data level. In order to correct it and provide a complete data set, imputation is required and it should be performed at the earliest stage as possible, i.e., the more thorough and rigorous imputation is at any stage, the fewer problems will be left to be solved at later stages. Moreover, the earlier the stage, the better and more accurate imputation is. So, at the first stage (micro-data level), State Members are responsible for implementing it; at the second (country data level) and third (Eurostat data level) stages, it should be carried out by Eurostat. Several imputation methods are available, but their application depends on the stage and the data set where missing values are found. After these values are estimated, a complete data set is available for

Page 30: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

29

Handbook on Data Validation in Eurostat

estimation and inference which is far better than using the available data only. The higher the quality and accuracy of the imputation methods, the better will those inferences be. The assessment of this quality can be achieved through simulation studies. Finally, time series techniques, not covered in this document, can also be considered for infra annual statistics if enough and good quality historical data is available.

Page 31: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

30

Handbook on Data Validation in Eurostat

4. ADVANCED VALIDATION The previous sections discussed several important aspects of validation, how to detect and correct several problems in the data before dissemination Other techniques like outlier detection and exploratory data analysis can complement previous approaches to improve data quality There are briefly reviewed in the following sections. 4.1 Literature review The literature on outlier detection is vast reflecting the very large number of methods that have been proposed. The main reference is Barnett and Lewis (1995), which we will follow closely. Other references will also be used depending on the methods discussed. An outlier is an extreme observation in the sense that it is surprisingly different from the other observations, leading one to think that it may have been generated by errors due to measurement, collection, coding, recording, transcription, processing … or model. They can distort data analysis and inference. In fact, as shown by Barnett and Lewis (1995), their presence can cause bias and a strong loss of efficiency in several estimators, even asymptotically; they can also seriously affect confidence intervals, whose confidence level may differ substantially from the nominal level, which is especially undesirable if the former is smaller than the latter; likewise, the size and power of hypothesis tests may also be severely distorted. 4.1.1 Strategies for handling outliers The first approach is to test whether a “suspicious” observation is in fact an outlier, which can be done with discordancy tests. The aim here is to reject (or replace) it from the data set or to identify it as a feature of special interest, i.e., as a manifestation of unsuspected factors of practical importance. From the point of view of production and dissemination of statistical information this is a crucial issue, since the objective is to free the data from any errors and to provide “clean” data sets. Therefore, outliers are usually considered as a nuisance and can lead to rejection of such observations. The second approach is the accommodation of outliers, where statistical methods are designed to draw valid inferences about the population from the sample without being seriously distorted by the presence of such observations. They “accommodate” the outliers at no serious inconvenience, i.e., they are robust against their presence. This approach mainly concerns the data analyst, not the data producer, since accommodation is a strategy for data analysis and inference. The role of any statistical agency is at an earlier stage and is mostly concerned with the detection of outliers and their rejection and consequent correction whenever necessary. We now briefly review the main methods of both approaches. 4.1.2 Testing for discordancy The objective of these procedures is the detection of outliers, but they are also useful to show several aspects and eventual problems in the data that may remain undiscovered by other analyses and methods. Therefore, any statistical methods that may help are valid to this purpose and consequently, we can only describe a few. We will thus focus on the most commonly used, always taking their simplicity into account. 4.1.2.1 Exploratory data analysis Every tool commonly used in exploratory data analysis can be very helpful here, since they are designed to show and measure the main characteristics of the data set which can uncover eventual relevant differences and problems. Graphical displays (dot plot, stem-and-leaf,

Page 32: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

31

Handbook on Data Validation in Eurostat

barplot, pie chart, histogram, boxplot, qq-plot, scatter plot, etc) and numerical coefficients (mean, median, mode, quartiles, variance, standard deviation, coefficient of variation, interquartile range, skewness and kurtosis coefficients) are extremely useful and easy to use, since they are extensively discussed in the literature and are included in any statistical software (useful references are Heiberger and Holland, 2004, and Wilkinson, 2005). Among all the different measures and graphs, the boxplot deserves a special reference, because it is particularly good in showing the main characteristics of the data and the existence of outliers. In fact, it was designed to do so. Letting Q1 and Q3 denote the first and the third quartiles respectively and 13 QQIQR −= the interquartile range, the commonly used

rule is the following: • A data value iX is considered a moderate outlier if IQR5.1QX 1i −< or if

IQR5.1QX 3i +> .

• A data value iX is considered a severe outlier if IQR3QX 1i −< or if IQR3QX 3i +> .

The values IQR3Q1 − and IQR3Q3 + are called the lower and upper outer fences

respectively and the values IQR5.1Q1 − and IQR5.1Q3 + are called the lower and upper

inner fences respectively. The boxplot clearly marks the eventual outliers, often using different symbols for moderate (for example, an asterisk) and for severe (for example, a circle) outliers. Moreover, the value taken for the lower whisker is the lowest observation below Q1 that does not cross the lower inner fence, and the value taken for the upper whisker is the highest observation above Q3 that does not cross the upper inner fence, i.e.: • Lower whisker { }1i1i QXIQR5.1Q:Xmin ≤≤−= .

• Upper whisker { }IQR5.1QXQ:Xmax 3i3i +≤≤= .

This choice for the whiskers makes outlier detection easier. This graphical representation can also be used for bivariate data by constructing conditional boxplots. For each category of a qualitative variable or each interval of a quantitative variable, a boxplot of the other (quantitative) variable is represented, thus showing the pattern of the latter according to the categories or classes of the former. Data analysis and in particular outlier identification are clearer, showing characteristics that would not be visible otherwise. Bivariate boxplots are also an alternative, but their interpretation is not very simple and clear, being preferable to use conditional boxplots. In a multivariate setting with more than two variables, conditional boxplots may be used for each pair. As a conclusion, every plot and numerical measure should be used for improving quality and discovering problems in the data. Among graphical displays, boxplots stand out and are particularly useful, especially in outlier detection. These plots should always be analysed in any data set. 4.1.2.2 Statistical testing for outliers Many test statistics have been proposed for outlier testing, often based on intuitive grounds. We first discuss tests for a single outlier in the sample and consider the multiple outlier problem next. We assume that the data are a sample of size n, ( )n1 X,,X K and the order

statistics will be denoted by ,X )i( ,n,,1i K= i.e., ( ) ( ) ( ).XXX n21 ≤≤≤ K

Page 33: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

32

Handbook on Data Validation in Eurostat

4.1.2.2.1 Single outlier tests We first assume that there may be only one outlier in the sample. The test statistics that have been proposed to this end may be grouped in the six main classes described below. • Excess/spread statistics – These are ratios of differences between an outlier and its nearest

or next-nearest neighbour to the range, or some other measure of dispersion of the sample (possibly omitting the outlier and other extreme observations). Assuming we are testing for an upper outlier ,X )n( some examples of these statistics are:

( ) ( )

( ) ( )1n

1nn

XX

XX

−− − or ( ) ( )

( ) ( )2n

1nn

XX

XX

−− − , or ( ) ( )

σ− −1nn XX

. (4.1)

where σ is the standard deviation of the basic model (normal population, for instance) which can be replaced by an estimate based on the sample, possibly without the observations that may be outliers or other extremes.

• Range/spread statistics – The numerator is replaced with the sample range, such as

( ) ( )

s

XX 1n −

where s, the sample standard deviation, may be based on the sample without the outliers or other extreme observations. If the population standard deviation is known, it should be used instead of s. Using the range has the disadvantage that a significant result can represent discordancy of an upper outlier, a lower outlier or both.

• Deviation/spread statistics – In the numerator, the distance between an outlier and some sample measure of location is used. An example for an upper outlier is

( )

s

XX n −

and, for a lower outlier,

( )

s

XX 1−.

Like s, X may also be based on the restricted sample or replaced by the population value, if known. A modification uses the maximum deviation in the numerator such as

s

XXmax i −.

• Sums of squares statistics – These are ratios of sums of squares for the restricted and total samples, such as

( )( )( )∑ −

∑ −

=

=−

n

1i

2

i

2n

1i

2

1n,ni

XX

XX (4.2)

where ( )

2n

XX

2n

1ii

1n,n −

∑=

=− was proposed for testing two upper outliers )1n(X − and )n(X in a

normal sample.

Page 34: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

33

Handbook on Data Validation in Eurostat

• Extreme/location statistics – These are ratios of extreme values to measures of location, such as

( )

X

X n (4.3)

to test for an upper outlier. • Higher-order moment statistics – Statistics such as measures of skewness and kurtosis, not

specifically designed for outlier testing, can also be useful for our purpose. Examples:

( )

( )2/3n

1i

2

i

n

1i

3

i

XX

XXn

∑ −

∑ −

=

= or ( )

( ).

XX

XXn

2n

1i

2

i

n

1i

4

i

∑ −

∑ −

=

=

Other classes of tests have been proposed, namely the maximum likelihood ratio principle and the principle of local optimality. Barnett and Lewis (1995) derive the test statistics and their properties for several distributions: normal, gamma (including exponential), log-normal, uniform, Poisson and Binomial, among others. However, since these tests are based on strong assumptions on the distributional properties of the population, namely the knowledge (or the assumption) of that distribution (such as in the tests derived by Barnett and Lewis mentioned in the previous paragraph), these approaches are not very useful for our purposes. In fact, in the data production process, that distribution is unknown and it is not possible to establish any realistic assumptions about it. 4.1.2.2.2 Multiple outlier tests It is possible to have more than a single extreme observation in the sample that have to be tested for discordancy which is a situation we already mentioned above. For example, consider a sample where )1n(X − and )n(X are unusually far to the right of the other ( )2n −

observations, or a sample where )1(X and )2(X are unusually far to the left of the other ( )2n −

observations. These two pairs need to be tested for discordancy because the first may be a pair of upper outliers and the second may be a pair of lower outliers. In these multiple outlier situations there may be several outlying observations and there are two types of procedures to test them for discordancy, block procedures and consecutive procedures. The considerations about single-outlier tests remain valid in multiple outlier situations. In block procedures, possible outliers are tested jointly as a whole and not separately, i.e., we either accept them both as consistent with the rest of the sample (no outliers), or conclude they are both discordant. As an example, consider the situation of testing a pair of upper outliers )1n(X − and )n(X . Generalizing single outlier tests, namely tests of the type of (4.1) and

(4.3) above, intuitive test statistics may be respectively

( ) ( )

( ) ( )12n

2nn

XX

XX

−−

− and ( ) ( )

X2

XX n1n +− .

Another example of a block procedure statistic is (4.2) above. Consecutive procedures simply involve the repeated use of single outlier tests. Suppose again that we are testing two upper outliers. The first step is to test whether )1n(X − is discordant

omitting )n(X from the sample. If )1n(X − is in fact considered to be an outlier, then )n(X is

also automatically considered to be one, since it is even more extreme. If, on the contrary,

Page 35: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

34

Handbook on Data Validation in Eurostat

)1n(X − is not considered an outlier, we proceed to test )n(X based on the whole sample and it

may be considered discordant or not. This strategy, called outward consecutive procedure, can be easily generalized to any number of outliers. There is no guidance about which procedure to choose in any particular situation. When here is some prior information leading the analyst to focus on some particular subset of the data, then a block procedure may be more suitable. 4.1.2.3 Multivariate data The task begins with the definition of an outlier, because a multivariate outlier no longer has a simple manifestation as an observation that stands out in one of the extremes of the sample. In fact, the sample has no “extreme” and a distance measure is necessary to this purpose. As a result, the discordancy tests available were proposed for specific multivariate distributions and are beyond our scope (Barnett and Lewis, 1995, discuss several procedures for multivariate normal, exponential and Pareto samples). Informal procedures are scarce and not easy to apply. Nevertheless, we will focus on cluster analysis, a well-known procedure for multivariate data that may help in identifying outliers. The goal of this analysis is to find natural groups (clusters) of items or variables such that the elements of each group are more similar with each other than with the elements belonging to the other groups. Grouping the data is based on a distance measure and it allows the analysis of observations with several dimensions and, as a by-product, the identification of possible outliers. There are several procedures to arrive at the final classification of the observations and consequently at the definition of clusters. It is recommended to apply several procedures with several distance measures. If the outcomes of the different methods are consistent, it is possible to conclude that there is natural grouping of the data. The result of the application of these methods may be displayed as a dendogram which is a bivariate plot that shows the partitioning that was made along the way. The plot has the shape of a tree whose branches represent the clusters. Outliers may lead to the formation of one or more clusters showing the presence of that (those) group(s) of observations with particular characteristics, very different from the rest of the sample. Another possible way of identifying outliers is that their presence will most likely lead to at least one cluster with very scattered items. This strong variability inside that cluster suggests the existence of extreme observations. A detailed discussion of cluster analysis is beyond our scope (Everitt, Landau and Leeds, 2001), but this is undoubtedly a powerful method in outlier identification in multivariate data. 4.1.3 Methods of accommodation The concern here is the development and application of statistical methods designed to draw valid inferences about the population from which the sample was selected and that will not be seriously distorted by the presence of outliers. These procedures “accommodate” the outliers at no serious inconvenience, i.e., they are “robust” against the presence of outliers. So, the aim here is to propose robust methods to the presence of outliers for inferential purposes. Therefore, we will only describe some commonly used methods, since drawing inferences is the analyst’s task and not the main goal of a statistical organization. In fact, the mission of the latter is to produce the data that will be used later by the former for those inferences. We will present some of the general methods that exist for constructing robust estimators, tests and confidence intervals. 4.1.3.1 Estimation of location

Page 36: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

35

Handbook on Data Validation in Eurostat

Suppose we want to estimate a location parameter, namely the population mean. The first robust estimator is the trimmed mean, whose objective is to control the variability due to the r lowest sample values ( ) ( )r1 X,,X K and to the s highest ones ( ) ( ).X,,X n1sn K+− If these

sr + observations are omitted, the mean of the remaining observations is

( ) ( )

srn

XXX sn1rT

s,r −−++

= −+ K

which is the (r,s)-fold trimmed mean. If each of the r lowest values is replaced by the value of the nearest observation to remain unchanged, i.e., ( )1rX + , and likewise for the s highest by

( )snX − , so that we work with a transformed sample of size n, we obtain the (r,s)-fold

Winsorized mean as

( ) ( ) ( ) ( )

n

sXXXrXX snsn1r1rW

s,r−−++ ++++

=K

.

If ,sr = they are called the r-fold symmetrically trimmed and Winsorized means respectively. Equivalently, it is possible to define the symmetric α-trimmed mean, where the amount of trimming is specified by the proportion 2α of the sample omitted (a proportion of α at each end), i.e., the mean is calculated with the central ( )α− 21n observations. Similarly, we can also define the α-Winsorized mean. Clearly, the 0-trimmed and the 0-Winsorized means correspond to the usual sample mean, the (½)-Winsorized mean is the same as the sample median me, and the (¼)-trimmed mean is called the mid-mean. A possible generalization is given by the class of L-estimators, i.e., estimators having the form of linear combinations of the ordered sample values: ( )ii Xc∑ (4.4)

where the weights ci are lower in the extremes than in the body of the data set. The sample median is a good example of this class of estimators, where 0ci = for all but the middle (or two middle, for even sample size) ordered observations. In fact, the sample median usually represents a good improvement in the presence of outliers, i.e., it is a very robust measure of location. Other classes have been proposed, such as M-estimators, where the estimation is obtained by solving an equation, or R-estimators, where the estimator is obtained from certain rank test procedures, such as the Wilcoxon test. Estimation of dispersion A natural approach to the robust estimation of dispersion is to use a robust location estimator. It is then possible to define the Winsorized variance. Another measure that has been proposed is the median deviation,

{ }meX,,meXmediansm n1 −−= K

where me is the sample median. L-estimation can also be used to measure dispersion in a similar fashion to location estimation above given in (4.4). For example, the semi-interquartile range is an useful L-estimator of dispersion:

2

QQSIQR 13 −=

where Q1 and Q3 are the first and the third quartiles respectively.

Page 37: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

36

Handbook on Data Validation in Eurostat

Finally, robust methods have also been proposed to interval estimation and hypothesis testing. Basically, the usual estimators are replaced by robust equivalents in such a way that the distribution of the statistic used remains the same (Barnett and Lewis, 1995). 4.1.4 Time series analysis Time series analysis is a powerful tool for handling outliers or uncover other problems in samples of observations recorded sequentially in time. The only drawback is that it usually requires moderate or large samples. Therefore, we will not discuss this approach in great detail and will only mention some of its main features and advantages. Among many others, recent references are Pena, Tiao and Tsay (2001) and Wei (2006). • Outlier detection and estimation – The study of outliers in time series has been approached

from two points of view. The first is the diagnostic approach in which diagnostic methods are applied to residuals of the estimated model to identify possible outliers that are tested afterwards (this is the testing for discordancy approach we discussed above). Once the outliers are identified, a model that incorporates them is proposed and the outlier effects and the model parameters are estimated jointly. The second approach is the application of a robust method, in which the estimation method is modified so that the parameter estimates are not contaminated by the presence of outliers (this is the accommodation approach we discussed above). The existence of outliers makes the estimates of the parameters seriously biased and severely affects the correlation properties of the time series. There are three main types of outliers: additive and innovational outliers and level shifts. Let tX be the observed series and tY be the outlier-free series which we assume follows

the ARMA model ( ) ( ) tt aBYB θ=φ , where at is a white noise process. An additive outlier

is defined as )T(ttt IYX ω+= where )T(

tI is the indicator variable representing the absence

( )Ttfor0I )T(t ≠= or the presence ( )Ttfor1I )T(

t == of the outlier at time T. An

innovational or innovative outlier is such that ( )( )

)T(ttt I

B

BYX ω

φθ+= . Hence, an additive

outlier affects only the level of the Tth observation, whereas an innovational outlier affects every observation from XT on. Level shifts correspond to a modification of the local mean or level of the time series from a specific point on. For a stationary series, a level shift implies a change in its mean after some point, turning it into a non-stationary series. The model for a level shift is

)T(ttt SYX ω+= where )T(

tS is a step function that takes the value 0 before time T and the

value 1 for .Tt ≥ Thus, the level shift can be seen as a sequence of additive outliers of the same size starting at some time point. More generally, a time series may contain several (additive, innovational) outliers or level shifts. The first step is to detect their presence and an iterative procedure has been proposed to this purpose starting with the assumption that there are none. It is also possible to estimate their effect on the values of the time series tX .

• Exploratory analysis – Previously to outlier testing and estimation, exploratory analysis of a time series can be extremely useful in detecting problems, including the presence of extreme observations. Such analysis is in fact an important step in (Box-Jenkins) ARIMA time series modelling. In particular, the correlogram of the series may show nonstationarity (in the mean or in the variance), outliers, level shifts and structural breaks or deviations from the trend. The boxplot of the time series can also show some of these problems and namely the presence of outliers.

Page 38: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

37

Handbook on Data Validation in Eurostat

• Decomposition models – Another helpful approach is the class of decomposition models, where a time series is expressed as the combination (sum or product) of four unobserved components, namely the trend, the cycle, seasonality and the residual component (a detailed account of these models may be seen in Makridakis, Wheelwright and Hyndman, 1998). They are estimated from the data and therefore the estimates can show any significant departures from the assumed model, in particular large deviations from the trend, changes in seasonality or high residual values. These departures can be the result of the occurrence of outliers or other problems and their cause has to be investigated.

4.2 Some general procedures applied in Countries Statistical analysis of the collected data and in particular explicit use of outlier detection and correction is not a common procedure in surveys conducted by Member States. Therefore, implementation of such methods should be pursued at the micro-data level (by Member States) in order to identify and solve many of the problems that may arise. Dealing with those problems at this early stage of the production process will avoid worse complications and errors in subsequent stages, increasing the quality and expediting the dissemination of the data, two fundamental goals of any statistical organization. Nevertheless, several checks and controls are carried out in data editing and correction that aim and can in fact be efficient at detecting outliers or other problems. Therefore, advanced validation is intimately related with data validation checks discussed above. This relationship is illustrated by the examples described below. 4.2.1 Foreign Trade • Admissibility intervals – Being part of the data checking procedures, detection of non-

admissible values is an example of outlier identification similar to a discordancy test although no formal statistical method is applied. A good example of such procedures is the computation of the admissibility intervals in Foreign Trade statistics. They are calculated at the highest disaggregation level (8 digits) every month and are based on information collected in the previous year according to the following steps: – Calculation of the mean unit value for all the observations available in the previous

year. – The value with the largest deviation from the mean is discarded. – Calculation of the mean value of the remaining observations. – The value with the largest deviation from the new mean is again discarded. – Repetition of the above steps until all the observations are within an interval of width

±20% from the mean or until 20% of the observations are discarded. – The limits of the admissibility interval are the minimum and the maximum of the

remaining observations in the end of this procedure. Note also that the mean calculated in the above steps is similar to a trimmed mean.

• Detection of large differences between the current period and historical data – This is similar to outlier detection, trend analysis or decomposition modelling in time series analysis although no formal statistical test is performed. In fact, the “large” differences seem to be detected depending on the analyst’s judgement of the past behaviour of the data without the application of any standardized statistical testing procedure. There is undoubtedly room for improvement here, since subjective judgement of the evolution of a time series should be avoided.

Page 39: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

38

Handbook on Data Validation in Eurostat

Fortunately, validation rules of every survey conducted by Member States include error checking procedures similar to this. For example, another procedure also used in several surveys starts by estimating the population mean and variance with previously collected data. Then, the interval of width ±3var (the estimate of the 99% confidence interval) centered in the mean is calculated and values outside this interval generate a warning message. This empirical rule is based on the central limit theorem, valid for aggregate data. The application of all these empirical rules and procedures mean that there is much to be done in outlier detection and correction at the micro-data level. 4.2.2 Consumer Price Index Outliers and large changes in prices are detected by the application of Tukey’s algorithm according to the following steps: • The ratio of the current price to the previous valid price (the growth factor) is calculated

for each price (for items tested by price level rather than price change, this stage is omitted.)

• For each item, the set of all such ratios is sorted into ascending order and ratios of 1 (unchanged prices) are excluded (for items tested by price level rather than price change, the prices themselves are sorted.)

• The top and bottom 5% of the list are removed. • Calculation of the mean of what is left, i.e., the 5%-trimmed mean, called the mid-mean (in

fact, this is a generalization of the mid-mean which is the 25%-trimmed mean). • Calculation of the upper and lower semi-mid-means which are the mid-means of all the

observations above or below the median. • The upper (lower) Tukey limit is the midmean plus (minus) 2.5 times the difference

between the midmean and the upper (lower) semi-midmean. The upper (lower) limit is increased (decreased), as necessary, to ensure that all unchanged prices fall within the Tukey limits.

• Price growth factors, or price levels, outside the Tukey limits are flagged as unacceptable, i.e., are considered outliers or discordant observations.

Tukey’s algorithm produces limits that are intuitively reasonable, consistent from month to month, robust in the presence of outliers (in other words, adding in one or two rogue observations does not affect the limits set by the algorithm very much) and robust as data volume changes (i.e. limits calculated from a subset of the data do not vary much from those calculated on the full data set). This is a more formal statistical procedure for outlier detection and seems to be adequate. It is easy to apply and may be used in several other statistical projects. 4.3 Some general procedures applied in Eurostat As in Member States, statistical analysis for the identification of problems and in particular for outlier detection is not usual practice in Eurostat. In fact, just like in MS, empirical (numerical) error detection procedures are used for data editing, namely the construction of admissibility intervals (for quantitative variables). When a given value falls outside such intervals, a warning is issued leading to an appropriate measure as deletion or correction. Moreover, some statistics are also used to detect errors in the data, such as the mean, the variance or the coefficient of variation. For example, a large variance or coefficient of variation may be caused by outliers or, at least, extreme values that have to be checked and

Page 40: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

39

Handbook on Data Validation in Eurostat

eventually corrected. Labour Force Statistics or Foreign Trade Statistics are examples of the application of these procedures in a similar fashion to what is done in Member States. 4.3.1 Community Innovation Survey In this survey, Eurostat receives the micro-data from MS and outlier detection is based on the (statistical) definition of a moderate outlier shown above, i.e., values lower than IQR5.1Q1 −

or higher than IQR5.1Q3 + are considered outliers (Q1 and Q3 are the first and the third

quartiles respectively and IQR is the interquartile range, 13 QQIQR −= ).

4.4 Guidance on advanced validation 4.4.1 Stages of advanced validation As for the previous two components of validation, there are also three stages for advanced validation . • The first stage concerns micro data and should be performed by Member States in the end

of the collection stage, because they are responsible for survey conducting, even when Eurostat receives this type of data. It is absolutely crucial that statistical data analysis and in particular outlier detection and correction are made at this early stage of the production process. In fact, it is still possible at this stage, and only at this stage, to correct eventual problems through the contact with the respondents without jeopardizing the deadlines until dissemination. Respondents can confirm or change their information and this is in many cases the best way to solve problems that may arise.

• The second stage concerns country data, i.e., the micro-data country aggregates. Leaving undetected problems in these data by a Member State and, in particular, the presence of outliers, requires Eurostat to filter and clean up the information received from MS.

• The third stage concerns aggregate (Eurostat) data before their dissemination. The country data sets used for aggregation should ideally be complete and free from errors, but there might be still a need for adjustment and therefore a final examination of the aggregates, in particular concerning outlier detection, has to be performed before the information is sent for dissemination.

The statistical methods and tests discussed above can be used in any of the three validation stages, i.e., the methods are valid for any data type. Nevertheless, some methods are more suited depending on the type of data and the level of aggregation level. 4.4.2 Micro data At the data collection stage, the detection and correction of problems should be made by Member States, even when Eurostat receives the micro data. Advanced validation should be run according to the following main procedures and tests whose results must always lead to a decision about what steps to take to correct those problems. 4.4.2.1 Advanced detection of problems Being more elaborate and careful, advanced detection can uncover problems that are left undetected by other methods and procedures. We next discuss the main steps to carefully scrutinize the data.

Page 41: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

40

Handbook on Data Validation in Eurostat

• Data mining – Examination of the main characteristics of the data based on graphical displays and on numerical measures and coefficients will most likely uncover the majority of the problems the data may have. With any statistical software, a standardized output can easily be constructed in such a way that it is only required to input the data (typing or transferring them) and the set of displays, numerical measures and eventual flagging of extreme values is automatically produced for analysis. This is a very simple but extremely powerful approach and it is highly recommended. It should be used with each variable included in any survey. For example, in foreign trade, it should be used with the data on exports and imports, on their country and product breakdown, or on unit prices; concerning industrial output, it should be applied to the total output and to its breakdown by sector or by product.

• Detection of outliers – Empirical rules such as those mentioned in the above examples and actually used by MS in data checking should be avoided and replaced by more sophisticated and accurate statistical methods, namely: – Classification of outliers – A simple procedure is the classification of moderate or

severe outliers described above, eventually changing the values of 1.5 or 3 that multiply the IQR, according to the specific data at hand.

– Tukey’s algorithm – This is a similar and more elaborate approach, combining robust estimation methods with discordancy testing. Some of the parameters used may be changed according to the data. This is definitely a good procedure and very easy to use.

– Statistical tests – The tests described above should always be performed even if other procedures are also applied because they can be more powerful. Nevertheless, they are also very easy and simple to apply. More than one test may be used and it is likely that some tests may be more adequate for a given data set than others. Pre-testing and simulation experiments should be conducted for any given project. Eurostat should play an important role in the harmonization, development and implementation of a common battery of tests in Member States.

– Cluster analysis – As mentioned above, this is a powerful tool for outlier identification in multivariate data. It is also simple to use and is included in most statistical software. Data classification can also show several other features, eventually showing other problems.

• Time series – A different perspective is adopted here. In fact, the analysis of the behaviour of the data through time may show some problems more clearly (outlier detection is one of them) or even show new problems. The only drawback is that usually time series models and methods are not automatic and not so simple to apply, requiring specific knowledge and the analyst’s intervention. This may cause some difficulties in our context, because of the large number of data sets and variables, i.e., the large number of time series, and the tight deadlines imposed. Thus, it is recommended that this approach is left for later stages of the validation process and is only used at the micro level in some special cases where strictly necessary.

4.4.2.2 Error correction When errors or other problems are detected in the micro data, they have to be corrected by Member States, even when Eurostat receives such data. It is very important that the correction is made at this early stage, i.e., at micro data level and as soon as possible after data collection. The main steps that should be adopted to this purpose are the following. • Correction for data entry errors

Page 42: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

41

Handbook on Data Validation in Eurostat

• Contact with the respondents – The best way to solve the detected problems is clearly to contact the respondents to correct or confirm the values provided. Therefore, this is the first step and great effort should be put into it. Alternative approaches are second-best solutions and should only be considered when this fails.

• Imputation of missing values – If the previous step fails, or does not lead to a solution on time, the values requiring correction have to be discarded from the data set generating missing values. For example, it is not possible to leave in the data set an observation that has been previously considered an outlier. Consequently, those missing values will have to be imputed with the methods of section 3.

It is then obvious why Member States should be in charge of advanced validation of micro data: contacting the respondents is clearly their task and they can also perform imputation at this level in a much more efficient way and with better quality. It is also more expedite which is very important because of the dissemination deadlines. 4.4.3 Country data Country aggregates received by Eurostat should already be validated at the micro level by the national statistical organizations. Nevertheless, some errors or problems can only be detected when data from the different countries are combined and therefore Eurostat should filter and solve those problems, consulting the country involved when possible. The methods of advanced validation are the same as for micro data and therefore will not be repeated here, but it is important to note that they are easier to apply because the number of data points is much smaller. In fact, for each variable, the number of observations is the number of countries while for micro data it is the number of respondents. Therefore, statistical analysis is an easier task which also means that it may be even more careful and pay attention to several aspects that may have been ignored or overlooked in the previous stage because of the size of the data sets. For example, several of the plots mentioned above are easier to analyse or the number of possible outliers is much smaller. Moreover, time series analysis is now more manageable and consequently it becomes a very powerful tool for data analysis and problem detection. In particular, ARIMA modelling, fitting decomposition models or outlier testing and accommodation is easier and should in fact be tried. When errors such as outliers are found in country data sets, Eurostat has to correct them, possibly after discussion with the national statistical organization involved, always keeping in mind the deadlines for dissemination. Discarding country data is not adequate because it would generate non-available values providing no information on that (those) country(ies) and preventing the computation of Eurostat aggregates. Consequently, imputation of those values is required with the methods of section 3 or with time series modelling that can in fact be extremely powerful and useful to this purpose because it can predict the missing observations with good accuracy. This approach is strongly recommended and should be applied. 4.4.4 Aggregate (Eurostat) data Some problems may become apparent only when the national data sets are aggregated, although the previous two stages of advanced validation will leave none or very few errors uncorrected. For example, an extreme value may be obtained for the aggregate of a given geographical zone resulting from the combination of the values of several countries that are

Page 43: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

42

Handbook on Data Validation in Eurostat

very high (or low) but are not clearly extreme and passed unnoticed at the two previous validation stages, particularly at the country level. The errors found have to be corrected at the country level and consequently we are back to the previous stage. In particular, imputation may be required if correction is not possible on time for dissemination. After this final validation stage is complete, country and Eurostat (aggregated) data are ready for dissemination. 4.4.5 Concluding remarks The objective of advanced validation is to detect problems and errors undiscovered by other procedures and checks. In fact, the statistical methods included here are more elaborate, have good properties and have shown satisfactory performances in applied analysis. These methods can be used at any of the three validation stages: micro-data level, country level and aggregate (Eurostat) level. However, using time series analysis may not be manageable at the first stage and is thus recommended for the other two where it is a very powerful tool, although it requires moderate or large sample sizes. Moreover, the first stage should be carried out by Member States and the other two by Eurostat. It is very important that error detection and correction (particularly concerning outliers) is performed at the earliest stage possible, otherwise those problems may be amplified at later stages and have a serious negative impact on the quality of the data. The earlier the stage, the more accurate and efficient the correction can be, bringing substantial advantages for the timely dissemination of the data. When this process is complete, the data are hopefully error-free, especially free of outliers, with a significant improvement of the quality of published statistical information. The performance of advanced validation methods can be assessed by comparing the corrected values with the corresponding revised data that will be obtained later. In particular, the detection and correction of outliers is especially relevant. To this end, accuracy measures such as the mean squared error may be calculated. It is also important to keep a record of the errors and in particular of the outliers detected in order to identify their sources and prevent the problems causing them in the future.

Page 44: PRACTICAL GUIDE TO DATA VALIDATIONec.europa.eu/.../PRACTICAL_GUIDE_TO_DATA_VALIDATION.pdf4 Handbook on Data Validation in Eurostat finally possible ways to correct the errors found

43

Handbook on Data Validation in Eurostat

REFERENCES Barnett, V. and Lewis, T. (1995). Outliers in Statistical Data. 3rd ed., John Wiley and Sons,

Chichester, West Sussex. Everitt, B.S., Landau, S. and Leeds, M. (2001). Cluster Analysis. Arnold, London. Fellegi, P. and Holt, D. (1976). A systematic approach to automatic edit and imputation,

Journal of the American Statistical Association, 71, 17-35. Heiberger, H. and Holland, B. (2004). Statistical Analysis and Data Display. Springer, New

York. Lehtonen, R. and Pahkinen, R. (2004). Practical Methods for Design and Analysis of

Complex Surveys. 2nd ed., John Wiley and Sons, Chichester, West Sussex. Little, R.J.A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data. 2nd ed., John

Wiley and Sons, Hoboken, NJ. Makridakis, S.G., Wheelwright, S.C. and Hyndman, R.J. (1998). Forecasting: Methods and

Applications. 2nd ed., John Wiley and Sons, New York. Eurostat Internal Document (2000). Imputation – Overview of Methods with Examples of

Procedures Used in Eurostat. Unit A4 (Research and Development, Methodology and Data Analysis), Eurostat, Luxembourg.

Pena, D., Tiao, G.C. and Tsay, R.S. (2001). A Course in Time Series Analysis. John Wiley

and Sons, New York. Rubin, D. B. (2004). Multiple Imputation for Nonresponse in Surveys. 2nd ed., John Wiley and

Sons, Hoboken, NJ. Wei, W.W.S. (2006). Time Series Analysis – Univariate and Multivariate Methods. 2nd ed.,

Addison-Wesley, New York.. Wilkinson, L. (2005). The Grammar of Graphics. Springer, New York.