evaluating the quality of business survey data before and ...€¦ · sander scholtus, bart bakker,...

12
Sander Scholtus, Bart Bakker, Sam Robinson Evaluating the quality of business survey data before and after automatic editing

Upload: others

Post on 20-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluating the quality of business survey data before and ...€¦ · Sander Scholtus, Bart Bakker, Sam Robinson Evaluating the quality of business survey data before and after automatic

Sander Scholtus, Bart Bakker, Sam Robinson

Evaluating the quality of business survey

data before and after automatic editing

Page 2: Evaluating the quality of business survey data before and ...€¦ · Sander Scholtus, Bart Bakker, Sam Robinson Evaluating the quality of business survey data before and after automatic

Structural Business Statistics (SBS)

Automatic editing in the Netherlands’ SBS:

2

unedited data

input data

data after DP1

data after EL/I

edited data

IP DP1 EL/I DP2

Input Processing - technical checks - correction of

uniform 1,000-errors

Deductive Processing 1 IF-THEN rules to solve common errors with a known cause

Error Localisation and Imputation - automatic error

localisation based on Fellegi-Holt paradigm

- imputation of missing and discarded values

Deductive Processing 2 IF-THEN rules to resolve inconsistencies that cannot be handled in EL/I

manual editing

Page 3: Evaluating the quality of business survey data before and ...€¦ · Sander Scholtus, Bart Bakker, Sam Robinson Evaluating the quality of business survey data before and after automatic

Data

– SBS 2012 ‐ Four industries in car trade (NACE G45)

‐ Focus on total turnover

– Linked to two administrative sources ‐ Value-Added Tax declarations (VAT)

‐ Profit Declaration Register (PDR)

3

industry (NACE code) 45112 45190 45200 45400

population (total) 18 680 1 790 6 054 1 763

population (w/o large/complex units) 18 556 1 739 6 018 1 759

SBS net sample, edited 914 165 269 74

SBS net sample, edited and linked to admin. data

810 158 231 58

Page 4: Evaluating the quality of business survey data before and ...€¦ · Sander Scholtus, Bart Bakker, Sam Robinson Evaluating the quality of business survey data before and after automatic

Intermittent-error model

– Extension of model by Guarnera & Varriale (2015, 2016) ‐ Multiple measurements 𝑦1, … , 𝑦𝐾 of target variable 𝜂

‐ For each observed variable 𝑦𝑘 and unit 𝑖:

𝑦𝑘𝑖 = 𝜂𝑖 if 𝑧𝑘𝑖 = 0

𝜏𝑘 + 𝜆𝑘𝜂𝑖 + 𝑒𝑘𝑖 if 𝑧𝑘𝑖 = 1

with 𝑃 𝑧𝑘𝑖 = 1 = 𝜋𝑘 and 𝑒𝑘𝑖~𝑁(0, 𝜎𝑘2) (independent across 𝑘)

‐ Systematic bias in 𝑦𝑘 occurs when 𝜏𝑘 ≠ 0 and/or 𝜆𝑘 ≠ 1

‐ Linear regression to describe variation in 𝜂𝑖 across units as a

function of covariates:

𝜂𝑖 = 𝜷′𝒙𝑖 + 𝑢𝑖 , 𝑢𝑖~𝑁(0, 𝜎2)

4

Page 5: Evaluating the quality of business survey data before and ...€¦ · Sander Scholtus, Bart Bakker, Sam Robinson Evaluating the quality of business survey data before and after automatic

Intermittent-error model

– Quality indicators for variable 𝑦𝑘 ‐ Error probability: 𝜋𝑘 = 𝑃 𝑧𝑘 = 1

‐ Intercept and slope parameters (bias): 𝜏𝑘 and 𝜆𝑘 − 1

‐ Indicator validity coefficient: absolute correlation of 𝑦𝑘 and 𝜂

• For error-prone records:

IVC 𝑦𝑘 𝑧𝑘 = 1 = 𝜆𝑘(𝑠)

= ⋯ = 1 −𝜎𝑘

2

𝜆𝑘2𝜎𝜂

2+𝜎𝑘2

𝜎𝜂2 = 𝜷′𝚺𝑥𝑥𝜷 + 𝜎2

• For error-free records:

IVC 𝑦𝑘 𝑧𝑘 = 0 = 1

• Overall:

IVC 𝑦𝑘 = 𝜋𝑘 IVC 𝑦𝑘 𝑧𝑘 = 1 + (1 − 𝜋𝑘) IVC 𝑦𝑘 𝑧𝑘 = 0

5

Page 6: Evaluating the quality of business survey data before and ...€¦ · Sander Scholtus, Bart Bakker, Sam Robinson Evaluating the quality of business survey data before and after automatic

Intermittent-error model

– Under the intermittent-error model… ‐ …the event 𝑦𝑘𝑖 = 𝜂𝑖 occurs with probability 1 − 𝜋𝑘

‐ …the event 𝑦𝑘𝑖 = 𝑦𝑙𝑖 occurs with probability 1 − 𝜋𝑘 1 − 𝜋𝑙

and in that case 𝑦𝑘𝑖 = 𝑦𝑙𝑖 = 𝜂𝑖

– Some 𝜂𝑖 and 𝑧1𝑖 , … , 𝑧𝐾𝑖 can be derived from 𝑦1𝑖 , … , 𝑦𝐾𝑖 ‐ Example: possible error patterns (𝑧1𝑖 , 𝑧2𝑖 , 𝑧3𝑖) for 𝐾 = 3

6

(0,0,0) (0,0,1) (0,1,0) (1,0,0) (0,1,1) (1,0,1) (1,1,0) (1,1,1)

id 𝒚𝟏 𝒚𝟐 𝒚𝟑

1 56 56 38

2 113 97 113

3 148 251 199

𝜼 𝒛𝟏 𝒛𝟐 𝒛𝟑

56 0 0 1

113 0 1 0

? ? ? ? 148 0 1 1 251 1 0 1 199 1 1 0

? 1 1 1

3 148 251 199

Page 7: Evaluating the quality of business survey data before and ...€¦ · Sander Scholtus, Bart Bakker, Sam Robinson Evaluating the quality of business survey data before and after automatic

id 𝒚𝟏 𝒚𝟐 𝒚𝟑 𝜼 𝒛𝟏 𝒛𝟐 𝒛𝟑

3 148 251 199

148 0 1 1 251 1 0 1 199 1 1 0

? 1 1 1

id 𝒚𝟏 𝒚𝟐 𝒚𝟑 𝜼 𝒛𝟏 𝒛𝟐 𝒛𝟑 𝑷

3 148 251 199

148 0 1 1 0.55 251 1 0 1 0.01 199 1 1 0 0.10 172 1 1 1 0.34

Intermittent-error model

– Model estimated by maximum likelihood ‐ Would be straightforward if all 𝜂𝑖 and 𝑧1𝑖, … , 𝑧𝐾𝑖 were known

– Use Expectation-Maximisation (EM) algorithm ‐ Implemented in R for 𝐾 = 3

– For units with unknown error patterns, we also obtain: ‐ Expected true value 𝜂 𝑖, given the observed values 𝑦1𝑖 , 𝑦2𝑖 , 𝑦3𝑖

‐ Posterior probabilities of error patterns, given 𝑦1𝑖 , 𝑦2𝑖 , 𝑦3𝑖

7

Page 8: Evaluating the quality of business survey data before and ...€¦ · Sander Scholtus, Bart Bakker, Sam Robinson Evaluating the quality of business survey data before and after automatic

Application

– Target variable (𝜂): total turnover

– Observed variables: ‐ SBS turnover (𝑦𝑆𝐵𝑆)

‐ VAT turnover (𝑦𝑉𝐴𝑇)

‐ PDR turnover (𝑦𝑃𝐷𝑅)

– Separate models for 𝑦𝑆𝐵𝑆 before/after automatic editing

– Covariates (𝒙): ‐ constant term

‐ SBS number of employees (edited)

‐ SBS total operating costs (edited)

– Logarithmic transformation applied to all variables

8

Page 9: Evaluating the quality of business survey data before and ...€¦ · Sander Scholtus, Bart Bakker, Sam Robinson Evaluating the quality of business survey data before and after automatic

NACE 45112

Results

parameter SBS (input) VAT PDR

𝜋𝑘 0.10 (0.01) 0.91 (0.01) 0.64 (0.02)

𝜏𝑘 0.68 (0.68) –0.26 (0.06) 0.04 (0.11)

𝜆𝑘 0.80 (0.11) 1.00 (0.01) 0.99 (0.01)

IVC | 𝑧𝑘 = 1 0.70 0.98 0.96

IVC 0.97 0.98 0.97

9

parameter SBS (edited) VAT PDR

𝜋𝑘 0.11 (0.02) 0.91 (0.01) 0.64 (0.02)

𝜏𝑘 –0.30 (0.49) –0.21 (0.06) 0.11 (0.12)

𝜆𝑘 0.98 (0.07) 0.99 (0.01) 0.98 (0.01)

IVC | 𝑧𝑘 = 1 0.83 0.98 0.95

IVC 0.98 0.98 0.97

Page 10: Evaluating the quality of business survey data before and ...€¦ · Sander Scholtus, Bart Bakker, Sam Robinson Evaluating the quality of business survey data before and after automatic

NACE 45200

Results

parameter SBS (input) VAT PDR

𝜋𝑘 0.13 (0.02) 0.76 (0.03) 0.55 (0.03)

𝜏𝑘 1.21 (1.80) 0.07 (0.10) 0.26 (0.13)

𝜆𝑘 0.40 (0.31) 0.99 (0.02) 0.97 (0.02)

IVC | 𝑧𝑘 = 1 0.26 0.98 0.98

IVC 0.91 0.98 0.99

10

parameter SBS (edited) VAT PDR

𝜋𝑘 0.14 (0.03) 0.76 (0.03) 0.54 (0.03)

𝜏𝑘 0.72 (0.90) 0.08 (0.10) 0.29 (0.13)

𝜆𝑘 0.70 (0.16) 0.99 (0.02) 0.96 (0.02)

IVC | 𝑧𝑘 = 1 0.68 0.98 0.97

IVC 0.96 0.98 0.99

Page 11: Evaluating the quality of business survey data before and ...€¦ · Sander Scholtus, Bart Bakker, Sam Robinson Evaluating the quality of business survey data before and after automatic

Discussion

– Effect of automatic editing on quality of SBS data? ‐ Results suggest: (very) limited effect on validity and bias

‐ SBS turnover values before and after editing often equal

‐ However:

• Model does not cover, e.g., consistency with edit rules

• Turnover reported with relatively high accuracy to begin with

• Results may be different for other industries

– Results suggest: remaining errors in edited data ‐ Use intermittent-error model to improve editing process?

• Imputation based on predicted true values

• Selective editing based on predicted true values and posterior

error probabilities

11

Page 12: Evaluating the quality of business survey data before and ...€¦ · Sander Scholtus, Bart Bakker, Sam Robinson Evaluating the quality of business survey data before and after automatic

Discussion

– Limitations of current model ‐ True values and errors not normally distributed in practice

‐ Errors between sources could be correlated

‐ Single target variable, but automatic editing is usually a

multivariate procedure

‐ Maximum likelihood does not account for sampling design

‐ …?

Thank you for your attention!

12