automatic editing with soft edits
DESCRIPTION
Automatic Editing with Soft Edits. Sander Scholtus (Statistics Netherlands). Automatic editing. Goal: Detect and correct errors and missing values without human intervention Data is made consistent with respect to a set of edits Two steps: - PowerPoint PPT PresentationTRANSCRIPT
Automatic Editingwith Soft EditsSander Scholtus(Statistics Netherlands)
Automatic Editing with Soft Edits 2
Automatic editing
• Goal: Detect and correct errors and missing values without human intervention
• Data is made consistent with respect to a set of edits• Two steps:
• detecting erroneous and missing values (error localisation)• imputation of new values
Automatic Editing with Soft Edits 3
Automatic editing (2)
• Fellegi-Holt paradigm for error localisation: Find the smallest subset of the variables that can be imputed to satisfy all edits
• Generalised version uses confidence weights• At Statistics Netherlands: SLICE software
Automatic Editing with Soft Edits 4
SLICE
• Branch-and-bound algorithm:
x1
x2 x2
x2 erroneous
x1 correct
x3 x3 x3 x3
x1 erroneous
x2 erroneousx2 correct x2 correct
Automatic Editing with Soft Edits 5
SLICE
• Branch-and-bound algorithm:
x1
x2 x2
eliminate x2
fix x1
x3 x3 x3 x3
eliminate x1
eliminate x2fix x2 fix x2
Automatic Editing with Soft Edits 6
SLICE (2)
• Leaf nodes of the tree:• all variables have been either fixed or eliminated• interpretation: eliminated variables are incorrect
• Associated sets of edits:• contain no variables• either empty or contain only trivial statements
• Theorem (De Waal and Quere, 2003):A leaf node corresponds to a feasible solution of the errorlocalisation problem, if and only if the associated set of editscontains no contradictions
Automatic Editing with Soft Edits 7
SLICE (3)
• Application of SLICE in the production process:• automatic editing of micro data for the Dutch structural
business statistics• approximately 100 variables and 100 edits• evaluation studies: sometimes large differences between
automatic and manual editing
Automatic Editing with Soft Edits 8
Hard edits and soft edits
• Examples of edits:1. Profit = Turnover – Costs2. Profit < 0.6 x Turnover
• First example:• hard edit• has to hold by definition
• Second example:• soft edit• can also be failed by correct values
Automatic Editing with Soft Edits 9
Hard edits and soft edits (2)
• Manual editing uses both hard and soft edits• Current methods for automatic editing can only
handle hard edits• Practical solutions:
• ignore all soft edits• treat soft edits as hard edits
• Can this be improved?
Automatic Editing with Soft Edits 10
Error localisation with soft edits
• Current error localisation problem:Minimise, among subsets of variables that can be imputed to
satisfy all edits, the sum of the confidence weights
• Suggested new error localisation problem:Minimise, among subsets of variables that can be imputed to
satisfy all hard edits, the sum of the confidence weights plus a cost term for failed soft edits
Automatic Editing with Soft Edits 11
Error localisation with soft edits (2)
• The new error localisation problem can be solved by an extended version of the SLICE algorithm
x1
x2 x2
eliminate x2
fix x1
x3 x3 x3 x3
eliminate x1
eliminate x2fix x2 fix x2
Automatic Editing with Soft Edits 12
Example
• Variables:Turnover (T), Profit (P), Costs (C), Number of Employees (N)
• Edits:Hard edits: Soft edits:
• Confidence weights:Turnover: 2; Profit: 1; Costs: 1; Number of Employees: 3
• Contribution of each failed soft edit: 2
05500000
TNNCTPCT
01.005.0
TPPT
Automatic Editing with Soft Edits 13
Example (2)
• Original data and edits:T = 100; P = 40000; C = 60000; N = 5Hard edits: Soft edits:
05500000
TNNCTPCT
01.005.0
TPPT
Automatic Editing with Soft Edits 14
Example (3)
• Original data and edits:T = 100; P = 40000; C = 60000; N = 5Hard edits: Soft edits:
• Eliminate P from the original edits:Implied hard edits: Implied soft edits:
05500000
TNNCTPCT
01.005.0
TPPT
0550000
TNNCT
01.105.0
CTCT
Automatic Editing with Soft Edits 15
Example (4)
• According to the theory, P can be imputed to satisfy all hard edits, but the second soft edit is failed
• Imputing only P is a feasible solution to the error localisation problem
• The value of the target function equals 1 + 2 = 3
Automatic Editing with Soft Edits 16
Example (5)
• Data and edits after eliminating P:T = 100; C = 60000; N = 5Implied hard edits: Implied soft edits:
• Eliminate C from these edits:Implied hard edits: Implied soft edits:
0550000
TNNCT
01.105.0
CTCT
055000
TNNT
06.001.1
TT
Automatic Editing with Soft Edits 17
Example (6)
• According to the theory, P and C can be imputed to satisfy all hard and soft edits
• Imputing P and C is a feasible solution to the error localisation problem
• The value of the target function equals 1 + 1 = 2• This turns out to be the optimal solution• Possible corrected version of the record:
T = 100; P = 40; C = 60; N = 5
Automatic Editing with Soft Edits 18
Example (7)
• Imputing only P is the optimal solution if the soft edits are ignored
• Corrected version of the record:T = 100; P = -59900; C = 60000; N = 5
Automatic Editing with Soft Edits 19
Discussion
• Future work:• Implementation of the algorithm in R (in progress)• Test on realistic data (Dutch structural business statistics)• How to model the costs of failed soft edits
Thank you for your attention!