# statistical data editing

DESCRIPTION

Statistical Data Editing. Anders Norberg, Statistics Sweden (SCB) 2011-11-11. Editing. Editing is an activity of detecting, resolving and understanding errors in data and produced statistics. Selective data editing. - PowerPoint PPT PresentationTRANSCRIPT

Statistical Data EditingStatistical Data Editing

Anders Norberg, Statistics Sweden (SCB)

2011-11-11

Editing

Editing is an activity of detecting, resolving and understanding errors in data and produced statistics

2

Selective data editing

The purpose of selective data editing is to reduce cost for the statistical agency as well as for the respondents, without significant decrease of the quality of the output statistics.

3

Papers by my colleague Leopold Granquist

Granquist (1984). On the role of editing. Statistical Review 2

Granquist (1997). The New View on Editing. International Statistical Review

Granquist and Kovar (1997). Editing of Survey Data: How Much is Enough? In Survey Measurement and Process Quality. Wiley.

If …

• we only want information from businesses that we know they have,

• and we ask for that information so they understand,

• and we motivate them to deliver as good quality in data as possible,

• and we help them to avoid accidental errors in answering questionnaires,

• then editing would be a minor process!5

Where errors are Where errors are introducedintroduced

• Errors in raw data delivered by respondents to the statistical agency are typically non-response and measurement errors

• Errors in data transmissions

• The statistics production process is a mixture of many activities with risks of introducing errors

6

The role of editingThe role of editing

• Quality Control of the measurement process– Find errors (efficient controls)

– Consider every identified error as a problem for the respondent to deliver correct data by our collection instrument

– Identify sources of error (process data)

– Analyse process data – communicate with cognitive specialists

• Contribute to quality declaration

• Adjust (change/correct) significant errors

3

Types of errors Obvious errors / Fatal errors

Non-valid values Item non-response Data structure- or model errors, total≠sum of components Contradictions

Suspected data values Deviation errors (Outliers)

• Suspiciously high/low values, data outside of predetermined limits

Definition errors (Inliers)• Many respondent miss-understand a question in the same way• Many respondents fetch data from info-systems with other definitions

Suspected data valuesDeviation errors

• Manual follow-up takes time and is expensive

• Few deviation errors have impact on output statistics (low hit-rate, many changes in data have very little impact)

Editing must have impact on the output!Remember response burdon !

Suspected data valuesDefinition errors (Inliers)• Difficult to find

• Ways to find them: Combined editing for several surveys

Deep interviews in focus groups

Use statistics from FEQ and from re-contacts with respondents

High proportions of item non-response

Graphical editing

Good examples

The Process Perspective

• Audit and improve data collection (measurement instrument and collection process)

• and the editing process itself

Un-edited data must be saved in order to produced important process indicators, as hit-rate and impact on output!

Process data

• Sources of errors = problem for the respondents

• Suspicions

• Error codes

• Manuel actions (accept / amended values)

• Automatic actions

Process indicators

• Sources of errors (problem for the respondents)

• Prop. of flagged units and variables

• Prop. of manually and automatically reviewed units and variables

• Prop. of amended values and impact of the changes, per variable

• Hit-rate for edits

14

Establish

needs

1

Plan and design

2

Create and test

3

Collect 4

Prepare and process

5

Analyse

6

Report and

communicate

7

Statistical Production Process

survey statistical needs

1.1

Affirmcustomer needs

1.2

Developtable plan

1.3

Identifydata sources

1.4

Examine disclosure

1.5

Carry out market process

1.6

Prepare data and statistical values

for dissemination7.1

produce final output

7.2

Report and communicate

final output7.3

handleinquiries

7.4

Marketfinal output

7.5

Classify and code micro data

5.1

Check micro data5.2

Impute for Non-response

5.3

Complement Data set

5.4

Calculate weights5.5

Establish final observation register

5.6

Create frame and draw sample

4.1

Handle respondent

issues4.2

Prepare data collection

4.3

Carry out data collection

4.4

Transfer and storedata electronically

4.5

Plan and design table plan

2.1

Plan and designdata collection

2.3

Plan and designdata processing

2.4

Plan and designanalysis and

reporting2.5

Planproduction flow

2.6

Designproduction system

2.7

Create and testMeasurem. instrument

3.1

Develop existing and building newproduction tools

3.2

Assure communication between

production tools3.3

Test production system

3.4

Carry outpilot test

3.5

Implementproduction tools

3.6

Implement production system

3.7

Producestatistical values

6.1

Quality assuranceof produced

statistics6.2

Interpret andexplain

6.3

Prepare contents for reporting andcommunication

6.4

Establish contentsfor reporting andcommunication

6.5

Plan and designframe/population

and sample

2.2

15

Establish

needs

1

Plan and design

2

Create and test

3

Collect 4

Prepare and process

5

Analyse

6

Report and

communicate

7

Statistical Production Process

survey statistical needs

1.1

Affirmcustomer needs

1.2

Developtable plan

1.3

Identifydata sources

1.4

Examine disclosure

1.5

Carry out market process

1.6

Prepare data and statistical values

for dissemination7.1

produce final output

7.2

Report and communicate

final output7.3

handleinquiries

7.4

Marketfinal output

7.5

Classify and code micro data

5.1

Check micro data5.2

Impute for Non-response

5.3

Complement Data set

5.4

Calculate weights5.5

Establish final observation register

5.6

Create frame and draw sample

4.1

Handle respondent

issues4.2

Prepare data collection

4.3

Carry out data collection

4.4

Transfer and storedata electronically

4.5

Plan and design table plan

2.1

Plan and designdata collection

2.3

Plan and designdata processing

2.4

Plan and designanalysis and

reporting2.5

Planproduction flow

2.6

Designproduction system

2.7

Create and testMeasurem. instrument

3.1

Develop existing and building newproduction tools

3.2

Assure communication between

production tools3.3

Test production system

3.4

Carry outpilot test

3.5

Implementproduction tools

3.6

Implement production system

3.7

Producestatistical values

6.1

Quality assuranceof produced

statistics6.2

Interpret andexplain

6.3

Prepare contents for reporting andcommunication

6.4

Establish contentsfor reporting andcommunication

6.5

Plan and designframe/population

and sample

2.2

16

Establish

needs

1

Plan and design

2

Create and test

3

Collect 4

Prepare and process

5

Analyse

6

Report and

communicate

7

Statistical Production Process

survey statistical needs

1.1

Affirmcustomer needs

1.2

Developtable plan

1.3

Identifydata sources

1.4

Examine disclosure

1.5

Carry out market process

1.6

Prepare data and statistical values

for dissemination7.1

produce final output

7.2

Report and communicate

final output7.3

handleinquiries

7.4

Marketfinal output

7.5

Classify and code micro data

5.1

Check micro data5.2

Impute for Non-response

5.3

Complement Data set

5.4

Calculate weights5.5

Establish final observation register

5.6

Create frame and draw sample

4.1

Handle respondent

issues4.2

Prepare data collection

4.3

Carry out data collection

4.4

Transfer and storedata electronically

4.5

Plan and design table plan

2.1

Plan and designdata collection

2.3

Plan and designdata processing

2.4

Plan and designanalysis and

reporting2.5

Planproduction flow

2.6

Designproduction system

2.7

Create and testMeasurem. instrument

3.1

Develop existing and building newproduction tools

3.2

Assure communication between

production tools3.3

Test production system

3.4

Carry outpilot test

3.5

Implementproduction tools

3.6

Implement production system

3.7

Producestatistical values

6.1

Quality assuranceof produced

statistics6.2

Interpret andexplain

6.3

Prepare contents for reporting andcommunication

6.4

Establish contentsfor reporting andcommunication

6.5

Plan and designframe/population

and sample

2.2

Average proportions of costs of sub-processes 2004

17

Process Proportion of total cost (%)

All products Short-period Annual surveys and periodic

Respondent service 3.3 3.3 3.4

Manual pre-editing 4.4 3.9 5.1

Data-registration editing 5.6 5.1 6.5

Production editing 15.3 12.7 18.9

Output editing 3.9 3.4 4.8

Total editing cost 32.6 28.3 38.6

Web data collection

Demands:

• High hit-rate in electronic questionnaires

• System that can measure hit-rate?

Question:

• Can it be a goal for us to move all editing to electronic data collection?

Expectation on the production editing process at Stat. Sweden

• Generic IT-tools– Less IT-maintenance– Easier planning of work and personnel at Data collection units– Better working environments– Methodology studies

• Efficient editing methods– Selective/significance editing– Better working environments– Less response burden

• Collection and analysis of process data– Continuous improvement of data collection and editing processes– Information for quality declaration of statistics

Input, throughput, output

-Coding Sum of wages by Industry -Decision makingRespondent (u) has one or several sampled units -Editing Industry -Information

-Imputation ASampled unit (k) -Estimation B

Observed Background variable Measurement var. (j) C

unit (l) Industry Gender Occup. 1 2=Wage D

1 E

2 B M 2 F - Z

3

4 Sum of wages by Occupation and Gender

GenderOccupa-tion Men Women Sum

1234

Sum

Input Throughput Output Use

jkly

Impact

• Actual impact = w ( y_une – y_edi) for observation k is

the impact on domain-total T if y_une is kept instead of

making a review to find y_edi.• Potential impact = w (y_edi – y_pred) is a proxy for

actual impact to be used in practice, as y_edi will not be

known until review. y_pred is a prediction (expected

value) for y_edi.• Anticipated (expected) impact (per domain, variable,

observation) is the product of suspicion and potential impact.

Suspicion: Traditional edits

Utkast/Version STATISTISKA CENTRALBYRÅN DOKUMENTTYP 1(1) Avd/Enhet/Projekt/Arbetsgrupp, etc 20xx-xx-xx Handläggare/Författare

Finding acceptance limits: Data from previous survey rounds

Hourly wage distributed by SNI code at one-digit level.

Selective data editing

Construct a score function for prioritizing variables and records. Alternatives:

1) Potential impact on statistics for records flagged by at least one traditional edit

2) Sum of expected impact on statistics for variable values flagged as suspected by edits

Norberg, A. et al. (2010): A General Methodology for Selective Data Editing. Statistics Sweden

23

Selective data editing

Potential

impact

Suspicion0 1

Flagged

A procedure which targets only some of the micro data variables or records for review by prioritizing the manual work.

24

1

5

Predicted (expected) valuesPredicted (expected) values

Data / predictor•Time series

•Previous value

•Forecast

•Cross section•Mean/standard error

•Median/quartile

Edit groups All data

Blue collar workers

White collars

Monthly pay

Weekly pay

Profession=3111

Profession=3112

Payment by the hour

Monthly pay

Weekly pay

Payment by the hour

Profession =1 Profession= 2 Profession= 3 Profession=9

Profession=3480

MenWomen

Profession=3113

21

Suspicion

R=

Suspicion=R/(TAU+R)

Susp

KAPPA = 0. The ratio R is the distance between t and the centre t~ divided by the dispersion range r = )()( ~~ LU tt ,

R = a/r:

KAPPA = 1. The ratio R is the distance from the nearest range limit divided by the range. Hence R = a/r. For data between the lower and upper limits of the dispersion range the suspicion is zero.

* * * r

a

* * *

a

r

lkjU

lkjlkjlkjL

lkjU

lkjlkjU

lkjlkjlkj

lkjU

lkjlkjlkjL

lkjlkjlkj

Llkjlkjlkjlkj

Llkj

Ulkjlkj

Llkjlkjlkj

zzKAPPAzzifzzzzKAPPAzz

zzKAPPAzzzzKAPPAz if

zzKAPPAzzifzzzzzKAPPAz

,,,,,,,,,,,,,,,,,,,,

,,,,,,,,,,,,,,

,,,,,,,,,,,,,,,,,,,,

~~~)~~/(~~~

~~~~~~ 0

~~~)~~/(~~~

MedianQ1 Q3

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

KAPPA=0 TAU=2

MedianQ1 Q3

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

KAPPA=1 TAU=2

MedianQ1 Q3

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

KAPPA=1 TAU=5

MedianQ1 Q3

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

KAPPA=2 TAU=0,1

Score function• Local score, by domain d, variable j & observed unit k,l

is the anticipated impact related to an appropriate measure of size for the domain/variable, say standard error of estimate.

• VIOLINj = weights for variables (j)

• CLARINETc = weights for domains by classifications d(c)

• OBOEj = adjustment for size of estimated total or its standard error (j)

• LScored,j,k,l = Suspicionj,k,l x(Potential impact) x CELLOd(c),j

( ){ }( ) jOBOE

1t,j,d1t,j,dj

)c(dj

j),c(dT̂SE,T̂ALFAmaximum

CLARINETVIOLINCELLO

•

•=

Score function

• Global scores are aggregated local scores by (5) domains, (4) variables, second stage units (3) to one score for each primary unit (2) and finally to (1) respondent.

• Methods: sum, sum of squares, sum of local scores truncated by local thresholds, maximum etc.

LAMBDA1

l

LAMBDAl,kk BETALScore,0maxGScore

28

Cut-off or probability sampling?Cut-off or probability sampling?

Say that 821 of the total sample (n=4 000) have a score >0.

There are two options for manual review:

– Cut-off sampling: Score2 >Threshold2, assuming the remaining bias is small

– Two-phase sampling: πps-sampling and design-based estimation of measurement errors to subtract from initial estimates

Ilves, K. (2010): Probability Approach to Editing. Workshop on Survey Sampling Theory and Methodology, Vilnius, Lithuania, August 23-27, 2010

38

EvaluationEvaluation

Relative pseudo-bias is a measure of error in output due to incomplete data review

100

100q

T̂SE

T̂T̂qRPB

EvaluationEvaluation

Psedobias for PPI relative to the overall price index. Observation units ordered in descending order of impact.

0

0,02

0,04

0,06

0,08

0,1

0,12

0,14

0,16

0,18

0,2

1 42 83 124 165 206 247 288 329 370 411 452 493 534 575 616 657 698 739 780 821

Antal ändringar

Editing – remaining methodology issuesEditing – remaining methodology issues

Fatal errors– Classifying variables– Survey variables

Confidence (respondents and clients) New and old respondents Edited in earlier processes

– Web-questionnaires– Scanned paper questionnaires

Data and methods for computing predicted values etc. Homogenous edit groups Priorities; variables, domains (from the clients perspective) Score functions How to decide threshold values Sampling below threshold

– Inference– Data for evaluation

SELEKT 1.1

Survey specific cold adapter (SAS code)Data preparation

SAS data set

PRE-SELEKTParameter specifications,Analysis of cold data

AUTOSELEKTScore calculation & record flagging

Records to FOLLOW-UP

Process data and reports

Input (hot) survey data

Records to IMPUTATION

Raw+edited past (cold) survey data

Survey specific hot adapter (SAS code)Data preparation

SAS data set

Table of Parameters

Table of Estimates

Accepted records

CLAN estimation software

SNOWDON-X analysis of edits

Edits

42

SELEKT Parameters (in)SELEKT Parameters (in)Parameter_group Parameter Value

A. Titles TITLE1 Simulated quarterly salary statistics. Enterprise data.A. Titles TITLE2 SELEKT ver 1.0 2010-02-04

B. Data to be edited EDIT_DATA Adap_Ent_Inflowdata

B. Data to be edited EDIT_T1_VALUE 2009

B. Data to be edited EDIT_T2_VALUE 1

C. Data structure ID2_TYPE

C. Data structure ID2_VAR EntNr

C. Data structure ID2_WGT PopEnt/SampleEnt

C. Data structure ID3_VAR

C. Data structure ID3_WGT 1

C. Data structure T1_VAR Year

C. Data structure T2_VAR Quarter

D. Y-, X- and Test-Variables EDI_X1 1

D. Y-, X- and Test-Variables EDI_X2 1

D. Y-, X- and Test-Variables EDI_Y1 Last_PopEmp

D. Y-, X- and Test-Variables EDI_Y1_T1 Last_PopEmp/Frame_PopEmpD. Y-, X- and Test-Variables EDI_Y1_T2 Last_PopEmp/First_Turnover

D. Y-, X- and Test-Variables EDI_Y2 Last_Turnover

D. Y-, X- and Test-Variables EDI_Y2_T1 Last_Turnover/Pre_Turnover

D. Y-, X- and Test-Variables UNE_X1 1

D. Y-, X- and Test-Variables UNE_X2 1

D. Y-, X- and Test-Variables UNE_Y1 First_PopEmp

D. Y-, X- and Test-Variables UNE_Y1_T1 First_PopEmp/Frame_PopEmpD. Y-, X- and Test-Variables UNE_Y1_T2 First_PopEmp/First_Turnover

D. Y-, X- and Test-Variables UNE_Y2 First_Turnover

D. Y-, X- and Test-Variables UNE_Y2_T1 First_Turnover/Pre_Turnover

E. Cold deck. Cross section CS_EXPECT MEDIAN

E. Cold deck. Cross section CS_HIST_DATA Adap_Ent_Surveydata(where=(Quarter^=3))E. Cold deck. Cross section CS_MIN_OBS 7

E. Cold deck. Cross section CS_SPREAD QUANTILES

E. Cold deck. Cross section CS_T1_END_VALUE 2008

E. Cold deck. Cross section CS_T1_START_VALUE 2005

E. Cold deck. Cross section CS_T2_END_VALUE 4

E. Cold deck. Cross section CS_T2_START_VALUE 1

E. Cold deck. Cross section HG_DIG1 1

E. Cold deck. Cross section HG_DIG2 1

E. Cold deck. Cross section HG_DIG3 1

E. Cold deck. Cross section HG_DIG4 8

E. Cold deck. Cross section HG_LENGTH 8

E. Cold deck. Cross section HG_MAX 4

Parameter_group Parameter Value

E. Cold deck. Cross section HG_VAR1 EnterpriseSize

Give AUTO-SELEKT the parameters:

%let PATH_sys=C:\SELEKT\1.0;%let PATH_app=C:\SELEKT\Prod\Demo1_Enterprise;%let EDIT_parms=Ent_Parms1;%let EDIT_data=Demo1_Adap_Ent_Inflowdata; %let EDIT_T1_Value=2009;%let EDIT_T2_Value=1;

By the parameter table &EDIT_parms AUTO-SELEKT knows what to do.

SELEKT Error list (out)• Identification:

• Column name = Variable

• Id1 = Identity for respondent[optional]

• Id2 = Identity for primary sampling unit (PeOrgnr, CfarNr etc.)

• Id3 = Identity for observational unit (Social security number, CN8 for products)

• EditNumber = Edit identification, if the edit flags for suspicions or obvious error.

• Timestamp = Time when the questionnaire passes SELEKT

• Process data:

• EditFlag = 0 = accepted, 1-5 = error flagged

• EditSuspicion = Suspicion generated by continuous edits.

• Score1 = (Local) Score for respondent[optional]

• Score2 = (Local/Global) Score for primary sampling unit [optional]

• Score3 = (Global) Score for observational unit [optional]

• N_Obs = Number of observations, which have gone through the edit round.

• N_Obs_Flagg = Number of error flagged observations in the PSU, on this list

• N_PSU = Number of PSU for the respondent, which have passed the edit round

• N_PSU_Flagg = Number of error flagged primary sampling units, on this list

EditsEditsEDIT GROUP AND ACCEPTANCE REGION

Edit identification Edit groupAcceptance range

EDIT

Edit identification

Type of edit

Active

Section

Internal error message

External error message

Instruction for data review

Un-edited test variable

Error flag

KEY

Edit identification

Survey variable

IMPACT ON STATISTICS

Survey variablePotential impact on statistics

54

3

FLAGGINGOBJECTS

EDIT PRACTICAL SUPPORT

Edit identificationStandard edit ruleEdited test variableSuspicion probability value produced by the SELEKT system

2

1