some considerations on developing a dwh for sbs estimates orietta luzi – mauro masselli istat -...
TRANSCRIPT
Some considerations on developing a DWH for SBS
estimates
Orietta Luzi – Mauro MasselliIstat - Italymarch 2013
The rationale of DWH
• the complete use of all the information (survey and administrative data) we have on the whole or about the entire target population;
• to build up a platform in which we integrate data and processes (from capturing to integrating data, from checking data to estimating results to disseminating estimates).
• the advantages in cancellation of sampling errors from one side and process integration and standardization on the other, exceed the disadvantages due to increasing non sampling errors and the partial loss of control on administrative data
goals
• First step: To establish a common set of estimates (micro/macro) among SBS and NA on observed economy
• Second step: Integration of other surveys on business (structural – ICT,R&D, externalò trade….. and STS)
Implications– Revision of sampling designs of SBS surveys– Revisions of production processes
Business Register • BR central role as Selection List and “frame”
• The target population is identified with all the enterprises listed in Business Register.
• For each unit BR contains two kind of variables:
– classification variables (NACE, legal Status, splits and joins, current status, etc..)
– content variables (e.g. the total number of persons employed, subtotals of different kind of workers, labour costs, an estimation of turn over ….).
– We assume that the classification variables and the variable “persons employed” and the implicit binary variable “existence of business” are by itself target variables and call them Z; they are kept by BR as they are and do not enter in any procedure of data treatment.
Target variables
The target variables can be divided into two groups:
• A set of “basic variables” X* needed for
the estimates required by the SBS - EU Regulation and by NA estimates ;
• The remaining variables Y* needed only for NA to be estimated conditionally to the first set
Sources
• the administrative sources: tax file, balance sheets, social security worker’s data, fiscal authority survey
• SBS surveys at moment, other structural business surveys in the next future
Administrative data
• How to asses the quality?
Some results from essnet admin data
• Essentially:• Definitions how much close are to SBS ones• Data analysis
» On overlapping data set» To identify biases
analysis of distributions
models on relationships between data sources
Administrative data• Advantages: costs, completness• Disadvantages: stability over time – data can be changed
for internal decision of the producing administration
» Operational definitions
» Data
indicators from overlapping Agreements with producers
data sets
Redisign sample surveys
From the collected variables to the target ones
For each enterprise, some of the X* variables may exist in one or more of the S sources in different combinations, according to the dimension, the social security rules, the fiscal status etc.
only for the sampled respondents units we have a complete set of target variables and these variables are set equal to X*.
The variables Ai reported in source “i” may coincide or may approximate the corresponding X*; in the second case it could be possible to “correct” some of them obtaining a set of more precise Xi “estimate” of X*.
number of sources business
1 788038 17,7%
2 2026129 45,6%
3 1160548 26,1%
4 358255 8,1%
5 8268 0,2%
no source 102645 2,3%
total 4443883 100,0%
Target variables x1* x2
* . . . xj* . . . xK
* source 1- survey Original =
corrected na na na na na na na na na na
source 2 original a2,1 na na na na na na a2,j+2 na a2,k corrected x2,1 na na na na na na x2,j+2 na x2,k source 3 original na na na na a3,j-1 na na na na a3,k corrected na na na na x3,j-1 na na na na x3,k Source 4 na na na na na na na na na na
na na na na na na na na na na
Source 5 na na na na na na na na na na
na na na na na na na na na na
From Ai to Xi
xij=aij in case of “good” fitting xij= f(aij…..) otherwise,
BR group of business
SBS survey Source2 Source3 Source4 Source5
number *1 XX *2 XX *3 XX *4 XX *5 XX Z, ID codes,
N1
N2
N3
N4
N5
N6
N7
…. ………… ………… ……….. ………. …………….
Nn
No source
The matrix X
The matrix X*
by establishing a hierarchy between sources
Macro-operators
Establishing target population List from Business Register and variables Z
Establishing target variables X* Reconciliation between NA and SBS operative definitions
Establishing Ai ……..AS
(collected variables) Analysis of data and definitions of the different sources A i with respect to the definitions of X*; the purpose is to evaluate the similarity of definitions in order: (i) to establish a hierarchy between the sources; (ii) to identify the correction to variables A
From variables A to variables X; where it is necessary and possible, correction of A; the variables a ij are transformed into xij by a “function”: xij=aij in case of “good” fitting or xij= F(aij…..) in case of correction
establishing variable Xi Outlier detection, selective editing
Establishing variable X* Hierarchy between sources/variables
Donor methods• Randomly
• By models • Eg the projection estimator
• By calculating a new variable to be used as a distance between donor and recipient
• Latent variables model
In all the methods we can use ex ante domains or can identify the appropriate variables to build up the donor domains
Establishing coherence:modify data of source i by data of source j
• Change some var Xi
• Check the impact on the other var Xi
• Modify other var Xi
• asses Xi
);;();( ,,,,, njmjkjmiki xxxfxx
);();;( ,,,,, mjkisiwili xxfxxx
E&I rules
Outliers detection and removal
A simplified example
Source i • Persons employed >
• Turnover
value addedlabour costs
• Intermediate costs
» Services
• Value added/persons employed ?
BR
• Persons employed and labour costs
Sources Hierarchy
• Ex ante - Based on
• How definitions of source i is close to SBS ones
» BR/social security data» SBS sample survey» Balance sheets» Fiscal authority survey » Tax files
• Prevoius and current data analysis
Correct A data to obtain X data
xi,k = f(ai,k,ai,m…)
By data analysis on overlapping data sets
By definitions
Other considerations
How to fill in the matrix X*to obtain the matrix X**
except for the group M1, survey respondents, in all the other cases we have a number of X* variable smaller than K (the needed target variables).
So for obtaining the estimates we can consider two options:• a massive imputation of missing values at micro level• an estimation of missing X* at macro level
BR
Survey
Micro integrationZ, X(1), X(2) …X(S)
Selection of X*; E&I; coherence among different sources
Micro Z X*
Massive imputation
Micro Z X**Y*
SBS estimation
Micro data treatment in the single sources admin
sources
Estimation of variables Y*
NA estimates
Micro integrationZ, A(1) A(2)…A(S)
Calculating X(1)…X(S)
E&I; coherence among different sources on imputed units
Micro NA treatment
Massive imputation
micro approach
BR
Survey
Micro integrationZ, X(1), X(2) …X(S)
Selection of X*; outliers detection;
Micro Z X*
Summing up by domains; inconsistencies clean up
Domain D estimates X**Y*
SBS estimates
Micro data treatment in the single sources
admin sources
Estimation of variables Y*
NA estimates
Micro integrationZ, A(1) A(2)…A(S)
Calculating X(1)…X(S)
macro approach
domain SBS survey Source2 Source3 Source4 Source5
Totals of all the
*1,1 XXx j Totals of all
*2,2 XXx j Totals of all
*3,3 XXx j Totals of all
*4,4 XXx j Totals of all
*5,5 XXx j
D1 )1( 1,1D jwx )1( ,3D jx )1( ,4D jx )1( ,5D jx
D2 )2( ,2D jx )2( 5D jx
……… ……………..……… ………………
Dr )( ,2rD jx )( 3rD jx )( ,4rD jx
…….. ……………..
DR-1
DR )( 1,1RD jwx )( ,4RD jx )( 5RD jx
Cross section and longitudinal approach
At moment the cross-sectional approach.
However the longitudinal approach has the significant features
• using “variations” is the logic adopted by NA estimating procedures
• we have “more information” to dealing with.
• implication
all the functions regarding the data control and imputation procedures could be developed considering both cross sectional and longitudinal “rules
Metadata
Generally speaking, we can roughly divide them in three broad sets:
Metadata needed to manage the data
the information related to process and procedures,
the wider documentation related to the different topics in developing the DWH. Sustainability different tools for managing