evaluating the quality of business survey data before and ...€¦ · sander scholtus, bart bakker,...
TRANSCRIPT
Sander Scholtus, Bart Bakker, Sam Robinson
Evaluating the quality of business survey
data before and after automatic editing
Structural Business Statistics (SBS)
Automatic editing in the Netherlands’ SBS:
2
unedited data
input data
data after DP1
data after EL/I
edited data
IP DP1 EL/I DP2
Input Processing - technical checks - correction of
uniform 1,000-errors
Deductive Processing 1 IF-THEN rules to solve common errors with a known cause
Error Localisation and Imputation - automatic error
localisation based on Fellegi-Holt paradigm
- imputation of missing and discarded values
Deductive Processing 2 IF-THEN rules to resolve inconsistencies that cannot be handled in EL/I
manual editing
Data
– SBS 2012 ‐ Four industries in car trade (NACE G45)
‐ Focus on total turnover
– Linked to two administrative sources ‐ Value-Added Tax declarations (VAT)
‐ Profit Declaration Register (PDR)
3
industry (NACE code) 45112 45190 45200 45400
population (total) 18 680 1 790 6 054 1 763
population (w/o large/complex units) 18 556 1 739 6 018 1 759
SBS net sample, edited 914 165 269 74
SBS net sample, edited and linked to admin. data
810 158 231 58
Intermittent-error model
– Extension of model by Guarnera & Varriale (2015, 2016) ‐ Multiple measurements 𝑦1, … , 𝑦𝐾 of target variable 𝜂
‐ For each observed variable 𝑦𝑘 and unit 𝑖:
𝑦𝑘𝑖 = 𝜂𝑖 if 𝑧𝑘𝑖 = 0
𝜏𝑘 + 𝜆𝑘𝜂𝑖 + 𝑒𝑘𝑖 if 𝑧𝑘𝑖 = 1
with 𝑃 𝑧𝑘𝑖 = 1 = 𝜋𝑘 and 𝑒𝑘𝑖~𝑁(0, 𝜎𝑘2) (independent across 𝑘)
‐ Systematic bias in 𝑦𝑘 occurs when 𝜏𝑘 ≠ 0 and/or 𝜆𝑘 ≠ 1
‐ Linear regression to describe variation in 𝜂𝑖 across units as a
function of covariates:
𝜂𝑖 = 𝜷′𝒙𝑖 + 𝑢𝑖 , 𝑢𝑖~𝑁(0, 𝜎2)
4
Intermittent-error model
– Quality indicators for variable 𝑦𝑘 ‐ Error probability: 𝜋𝑘 = 𝑃 𝑧𝑘 = 1
‐ Intercept and slope parameters (bias): 𝜏𝑘 and 𝜆𝑘 − 1
‐ Indicator validity coefficient: absolute correlation of 𝑦𝑘 and 𝜂
• For error-prone records:
IVC 𝑦𝑘 𝑧𝑘 = 1 = 𝜆𝑘(𝑠)
= ⋯ = 1 −𝜎𝑘
2
𝜆𝑘2𝜎𝜂
2+𝜎𝑘2
𝜎𝜂2 = 𝜷′𝚺𝑥𝑥𝜷 + 𝜎2
• For error-free records:
IVC 𝑦𝑘 𝑧𝑘 = 0 = 1
• Overall:
IVC 𝑦𝑘 = 𝜋𝑘 IVC 𝑦𝑘 𝑧𝑘 = 1 + (1 − 𝜋𝑘) IVC 𝑦𝑘 𝑧𝑘 = 0
5
Intermittent-error model
– Under the intermittent-error model… ‐ …the event 𝑦𝑘𝑖 = 𝜂𝑖 occurs with probability 1 − 𝜋𝑘
‐ …the event 𝑦𝑘𝑖 = 𝑦𝑙𝑖 occurs with probability 1 − 𝜋𝑘 1 − 𝜋𝑙
and in that case 𝑦𝑘𝑖 = 𝑦𝑙𝑖 = 𝜂𝑖
– Some 𝜂𝑖 and 𝑧1𝑖 , … , 𝑧𝐾𝑖 can be derived from 𝑦1𝑖 , … , 𝑦𝐾𝑖 ‐ Example: possible error patterns (𝑧1𝑖 , 𝑧2𝑖 , 𝑧3𝑖) for 𝐾 = 3
6
(0,0,0) (0,0,1) (0,1,0) (1,0,0) (0,1,1) (1,0,1) (1,1,0) (1,1,1)
id 𝒚𝟏 𝒚𝟐 𝒚𝟑
1 56 56 38
2 113 97 113
3 148 251 199
𝜼 𝒛𝟏 𝒛𝟐 𝒛𝟑
56 0 0 1
113 0 1 0
? ? ? ? 148 0 1 1 251 1 0 1 199 1 1 0
? 1 1 1
3 148 251 199
id 𝒚𝟏 𝒚𝟐 𝒚𝟑 𝜼 𝒛𝟏 𝒛𝟐 𝒛𝟑
3 148 251 199
148 0 1 1 251 1 0 1 199 1 1 0
? 1 1 1
id 𝒚𝟏 𝒚𝟐 𝒚𝟑 𝜼 𝒛𝟏 𝒛𝟐 𝒛𝟑 𝑷
3 148 251 199
148 0 1 1 0.55 251 1 0 1 0.01 199 1 1 0 0.10 172 1 1 1 0.34
Intermittent-error model
– Model estimated by maximum likelihood ‐ Would be straightforward if all 𝜂𝑖 and 𝑧1𝑖, … , 𝑧𝐾𝑖 were known
– Use Expectation-Maximisation (EM) algorithm ‐ Implemented in R for 𝐾 = 3
– For units with unknown error patterns, we also obtain: ‐ Expected true value 𝜂 𝑖, given the observed values 𝑦1𝑖 , 𝑦2𝑖 , 𝑦3𝑖
‐ Posterior probabilities of error patterns, given 𝑦1𝑖 , 𝑦2𝑖 , 𝑦3𝑖
7
Application
– Target variable (𝜂): total turnover
– Observed variables: ‐ SBS turnover (𝑦𝑆𝐵𝑆)
‐ VAT turnover (𝑦𝑉𝐴𝑇)
‐ PDR turnover (𝑦𝑃𝐷𝑅)
– Separate models for 𝑦𝑆𝐵𝑆 before/after automatic editing
– Covariates (𝒙): ‐ constant term
‐ SBS number of employees (edited)
‐ SBS total operating costs (edited)
– Logarithmic transformation applied to all variables
8
NACE 45112
Results
parameter SBS (input) VAT PDR
𝜋𝑘 0.10 (0.01) 0.91 (0.01) 0.64 (0.02)
𝜏𝑘 0.68 (0.68) –0.26 (0.06) 0.04 (0.11)
𝜆𝑘 0.80 (0.11) 1.00 (0.01) 0.99 (0.01)
IVC | 𝑧𝑘 = 1 0.70 0.98 0.96
IVC 0.97 0.98 0.97
9
parameter SBS (edited) VAT PDR
𝜋𝑘 0.11 (0.02) 0.91 (0.01) 0.64 (0.02)
𝜏𝑘 –0.30 (0.49) –0.21 (0.06) 0.11 (0.12)
𝜆𝑘 0.98 (0.07) 0.99 (0.01) 0.98 (0.01)
IVC | 𝑧𝑘 = 1 0.83 0.98 0.95
IVC 0.98 0.98 0.97
NACE 45200
Results
parameter SBS (input) VAT PDR
𝜋𝑘 0.13 (0.02) 0.76 (0.03) 0.55 (0.03)
𝜏𝑘 1.21 (1.80) 0.07 (0.10) 0.26 (0.13)
𝜆𝑘 0.40 (0.31) 0.99 (0.02) 0.97 (0.02)
IVC | 𝑧𝑘 = 1 0.26 0.98 0.98
IVC 0.91 0.98 0.99
10
parameter SBS (edited) VAT PDR
𝜋𝑘 0.14 (0.03) 0.76 (0.03) 0.54 (0.03)
𝜏𝑘 0.72 (0.90) 0.08 (0.10) 0.29 (0.13)
𝜆𝑘 0.70 (0.16) 0.99 (0.02) 0.96 (0.02)
IVC | 𝑧𝑘 = 1 0.68 0.98 0.97
IVC 0.96 0.98 0.99
Discussion
– Effect of automatic editing on quality of SBS data? ‐ Results suggest: (very) limited effect on validity and bias
‐ SBS turnover values before and after editing often equal
‐ However:
• Model does not cover, e.g., consistency with edit rules
• Turnover reported with relatively high accuracy to begin with
• Results may be different for other industries
– Results suggest: remaining errors in edited data ‐ Use intermittent-error model to improve editing process?
• Imputation based on predicted true values
• Selective editing based on predicted true values and posterior
error probabilities
11
Discussion
– Limitations of current model ‐ True values and errors not normally distributed in practice
‐ Errors between sources could be correlated
‐ Single target variable, but automatic editing is usually a
multivariate procedure
‐ Maximum likelihood does not account for sampling design
‐ …?
Thank you for your attention!
12