practical guide to data validation - europa · pdf file• continuous improvement of data...

Download PRACTICAL GUIDE TO DATA VALIDATION - Europa · PDF file• Continuous improvement of data editing procedures – For repeated statistics, data editing ... PRACTICAL GUIDE TO DATA VALIDATION

If you can't read please download the document

Upload: vudang

Post on 08-Feb-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

  • PRACTICAL GUIDE

    TO

    DATA VALIDATION

    IN

    EUROSTAT

    EUROSTAT

    Statistical Office of the European Communities

  • 1

    Handbook on Data Validation in Eurostat

    TABLE OF CONTENTS

    1. Introduction .............. 3 2. Data editing ........... 5 2.1 Literature review .... 5 2.2 Main general procedures adopted in Member States ..... 6 2.2.1 Foreign Trade ............................................................................................ 6 2.2.2 Industrial Output ...................................................................................... 6 2.2.3 Commercial Companies Survey .............................................................. 7 2.2.4 Employment Survey .. 7 2.2.5 Private Sector Statistics on Earnings ... 7 2.2.6 Survey of Business Owners .. 8 2.2.7 Building Permits Survey ........................................................................... 8 2.3 Main general procedures adopted in Eurostat . 8 2.3.1 Harmonization of national data ... 9 2.3.2 Corrections using data from the same Member State ....... 9 2.3.3 Corrections using data from other Member States ....... 9 2.3.4 Foreign Trade .... 10 2.3.5 Transport Statistics ... 10 2.3.6 Labour Force Survey .... 10 2.3.7 Eurofarm 10 2.4 Guidelines for data editing . 10 2.4.1 Stages of data editing .... 10 2.4.2 Micro data ...... 11 2.4.2.1 Error detection .. 11 2.4.2.2 Error correction .... 12 2.4.3 Country data ...... 13 2.4.3.1 Error detection .. 13 2.4.3.2 Error correction .... 14 2.4.4 Aggregate (Eurostat) data .... 16 2.4.5 Concluding remarks ..... 16 3. Missing data and imputation .. 18 3.1 Literature review ........ 18 3.1.1 Single imputation methods ... 18 3.1.1.1 Explicit modelling . 18 3.1.1.1.1 Mean imputation 18 3.1.1.1.2 Regression imputation ....... 19 3.1.1.2 Implicit modelling . 19 3.1.1.2.1 Hot deck imputation .. 19 3.1.1.2.2 Substitution ..... 20 3.1.1.2.3 Cold deck imputation ............. 20 3.1.1.2.4 Composite methods .... 21 3.1.2 Multiple imputation methods ... 21

  • 2

    Handbook on Data Validation in Eurostat

    3.2 Main general procedures adopted in Member States . 21 3.2.1 Foreign Trade ............................................................................................ 21 3.2.2 Industrial Output ...................................................................................... 22 3.2.3 Commercial Companies Survey .............................................................. 22 3.2.4 Employment Survey .. 23 3.2.5 Annual Survey of Hours and Earnings 23 3.2.6 Survey of Business Owners ...... 23 3.2.7 Building Permits Survey ........................................................................... 23 3.2.8 Housing Rents Survey ............................................................................... 23 3.2.9 Basic Monthly Survey ............................................................................... 24 3.3 Main general procedures adopted in Eurostat . 24 3.3.1 Community Innovation Survey .... 24 3.3.2 Continuing Vocational Training Survey ............................. 25 3.3.3 European Community Household Panel ......................... 25 3.4 Guidelines for data imputation .. 25 3.4.1 Stages of imputation of missing data .... 25 3.4.2 Micro data ...... 26 3.4.3 Country data ...... 26 3.4.4 Aggregate (Eurostat) data .... 28 3.4.5 Concluding remarks ..... 28 4. Advanced validation 30 4.1 Literature review ........ 30 4.1.1 Strategies for handling outliers .... 30 4.1.2 Testing for discordancy ........ 30 4.1.2.1 Exploratory data analysis ............. 31 4.1.2.2 Statistical testing for outliers ... 31 4.1.2.2.1 Single outlier tests ...... 32 4.1.2.2.2 Multiple outlier tests ...... 33 4.1.2.3 Multivariate data ... 33 4.1.3 Methods of accommodation ..... 34 4.1.3.1 Estimation of location ................... 35 4.1.3.2 Estimation of dispersion ............... 35 4.1.4 Time series analysis ........ 35 4.2 Main general procedures adopted in Member States . 37 4.2.1 Foreign Trade ............................................................................................ 37 4.2.2 Consumer Price Index .............................................................................. 38 4.3 Main general procedures adopted in Eurostat . 38 4.3.1 Community Innovation Survey .... 39 4.4 Guidelines for advanced validation ....... 39 4.4.1 Stages of advanced validation .......... 39 4.4.2 Micro data ...... 39 4.4.2.1 Advanced detection of problems .. 40 4.4.2.2 Error correction 41 4.4.3 Country data ...... 41 4.4.4 Aggregate (Eurostat) data .... 42 4.4.5 Concluding remarks ..... 42 References ..... 43

  • 3

    Handbook on Data Validation in Eurostat

    1. INTRODUCTION A main goal of any statistical organization is the dissemination of high-quality information and this is particularly true in Eurostat. Quality implies that the data available to users have the ability to satisfy their needs and requirements concerning statistical information and is defined in a multidimensional way involving six criteria: Relevance, Accuracy, Timeliness and punctuality, Accessibility and clarity, Comparability and Coherence. Broadly speaking, data validation may be defined as supporting all the other steps of the data production process in order to improve the quality of statistical information. In the Handbook on improving quality by analysis of process variables (LEG on Quality project by ONS UK, Statistics Sweden, National Statistical Service of Greece, and INE PT) it is described as the method of detecting errors resulting from data collection. In short, it is designed to check plausibility of the data and to correct possible errors and is one of the most complex operations in the life cycle of statistical data, including steps and procedures of two main categories: checks (or edits) and transformations (or imputations). Its three main components are the following: Data editing The application of checks that identify missing, invalid or inconsistent

    entries or that point to data records that are potentially in error. . Missing data and imputation Analysis of imputation and reweighting methods used to

    correct for missing data caused by non-response. Non-response can be total, when there is no information on a given respondent (unit non-response), or partial, when only part of the information on the respondent is missing (item non-response). Imputation is a procedure used to estimate and replace missing or inconsistent (unusable) data items in order to provide a complete data set.

    Advanced validation Advanced statistical methods can be used to improve data quality. Many of them are related to outlier detection since the conclusions and inferences obtained from a contaminated (by outliers) data set may be seriously biased.

    Before Eurostat dissemination, data validation has to be performed at different stages depending on who is processing the data: The first stage is at the end of the collection phase and concerns micro data. Member States

    are responsible for it, since they conduct the surveys. The second stage concerns country data, i.e., the micro-data country aggregates sent by

    Member States to Eurostat. Validation has to be performed by the latter at this stage. The third and last stage concerns aggregate (Eurostat) data before their dissemination and

    it is also performed by Eurostat.

    Validation should be performed according to a set of common (to what ? all sources and records? For one application or for all) and specific rules depending on the stage and on the data aggregation level. In this document, some general and common guidance are provided for each stage. More detailed rules and procedures can only be provided when looking at a specific survey, i.e., since each one has its own particular characteristics and problems. A thorough set of validation guidelines can only then be defined for a specific statistical project. Nevertheless, this document intends to discuss the most important issues that arise concerning validation of any statistical data set, describing its main problems and how to handle them. It lists as thoroughly as possible the different aspects that need to be analyzed for error diagnostic and checking, the most adequate methods and procedures for that purpose and

  • 4

    Handbook on Data Validation in Eurostat

    finally possible ways to correct the errors found. It should be seen as an introduction to data validation and provide references to further reading by any statistician or staff of a statistical organization working on this matter. That is, being the general starting point for data validation, this document may be applied and adapted to any partic