investigation of macro editing techniques for outlier detection in survey data
DESCRIPTION
Investigation of Macro Editing Techniques for Outlier Detection in Survey Data. Katherine Jenny Thompson Office of Statistical Methods and Research for Economic Programs. Simplified Survey Processing Cycle. Data Collection/ Analyst Review. Micro-editing And Imputation. Individual Returns. - PowerPoint PPT PresentationTRANSCRIPT
Investigation of Macro Editing Techniques for Outlier Detection in
Survey Data
Katherine Jenny ThompsonOffice of Statistical Methods and Research for
Economic Programs
Simplified Survey Processing Cycle
Data Collection/Analyst Review Micro-editing
And ImputationIndividual Returns
Macro-editing Tabulated Initial
Estimates
Analyst InvestigationAnd Correction
Publication Estimates
Identifying Outlying Estimates
• Set of Estimates– Unknown parametric distribution (robust)– Contains outliers (resistant)
• Outlier-identification problems (Multiple Outliers)– Masking: difficult to detect an individual outlier– Swamping: too many false outliers flagged
Outlier Detection Approaches
• Sets of “bivariate” (Ratio) comparisons – Same estimate from two consecutive
collection periods (historic cell ratios)– Different estimates in same collection
period (current cell ratios)• Multivariate comparisons
– Current period data
Method for Bivariate Comparisons
• Resistant Fences Methods– Symmetrized Resistant fences– Asymmetric Fences
• Robust Regression• Hidiroglou-Berthelot Edit
Bivariate Comparisons (Current Cell Ratios)
• Linear relationship between payroll and employment• No intercept
Paired Estimates
0
1000
2000
3000
4000
5000
6000
7000
0 20 40 60 80 100 120
Total Employment
Ann
ual P
ayro
ll
Paired Estimates
“Traditional” Ratio Edit (Current Cell Ratio)
0
1000
2000
3000
4000
5000
6000
7000
8000
0 20 40 60 80 100 120
Total Employment
Annu
al P
ayro
ll
Paired Estimates Lower Tolerance Upper Tolerance
• “Cone-shaped” tolerances• Goes through origin• Strong statistical association
Acceptance Region
Outlier Region
Outlier Region
Resistant Fences Methods
q25 q75
q25-1.5H q75+1.5H
• Different numbers of interquartile ranges (1.5 = Inner, 3 = Outer)
• Implicitly assumes symmetry
• May want to “symmetrize”, apply rule, use inverse transformation
Asymmetric Fences Methods
q75+3 (q75- m)q25+3 (m – q25)
• Different numbers of interquartile ranges (3 = Inner, 6 = Outer)
• Incorporates skewness of distribution in outlier rule (“Fences”)
Robust Regression
• Least Trimmed Squares Robust Regression • Resistant (minimizes median residual)• Outlier = |residual| 3 robust M.S.E.
0
1000
2000
3000
4000
5000
6000
7000
0 20 40 60 80 100 120
Total Employment
Annu
al P
ayro
ll
Paired Estimates Robust Regression Line
Issue at Origin (Historic Cell Ratio)
0
10
20
30
40
50
60
70
80
0 5 10 15 20 25 30 35
Prior Month's Number of Employees
Curr
ent M
onth
's N
umbe
r of
Em
ploy
ees
Hidiroglou-Berthelot (HB) Edit
-250
-200
-150
-100
-50
0
50
0 20 40 60 80 100 120
Employment
HB "
Effe
cts"
Upper Bound Lower Bound Effects
• Accounts for magnitude of unit (variability at origin)
Hidiroglou-Berthelot (HB) Edit• Two-step transformation (Ei)
– Centering transformation on ratios– Magnitude transformation that accounts for the relative
importance of large cases
• Asymmetric Fences “Type” Outlier Rule
• Key ParameterU = magnitude transformation parameter (0 U 1)C = controls width of outlier region
Multivariate Methods: Mahalanobis Distance
• Multivariate normal (,)
– T(X) estimates – C(X) estimates – p is the number of distinct variables (items)
• Prone to masking (difficult to detect individual outliers)
2~))()(())(( piii TxCTxMD XXX
Robust Alternatives • M-estimation (not considered)• “Production Method”• Minimum Volume Ellipse (MVE)
– Resistant (50% breakdown) and robust• Minimum Covariance Determinant (MCD)
– Resistant (50% breakdown) and robust
• Assumption of Normality– Log-transformation
Evaluation: Classify Item EstimatesInput Value
ReportedFinal Value
Tabulated
RatioInput/Final
OutlierPotentialOutlierNot an Outlier
0
5
10
15
20
25
30
35
40
45
50
Ratio Values
Freq
uenc
y C
ount
s
Evaluation: Classify Ratios (Bivariate)
• Conservative– Ratio is “outlier” if numerator or
denominator is an outlier• Anti-Conservative
– Ratio is “outlier” if numerator or denominator is an outlier or a potential outlier
Evaluation: Classify Records (Multivariate)• Conservative
– Record is “outlier” at least one estimate is an outlier
• Anti-Conservative– Record is “outlier” at least one estimate is
an outlier or a potential outlier
Evaluation Statistics: Bivariate Comparisons
• Individual Test Level• Type I Error Rate: proportion of false rejects• Type II Error Rate: proportion of false accepts• Hit Rate: proportion of flagged estimates that are
outliers
• All-Test Level• All-item Type II error rate
Evaluation Statistics: Multivariate Comparisons
• Type I error rate: the proportion of non-outlier records that are flagged as outliers
• Type II error rate: the proportion of outlier records that are not flagged as outliers (missed “bad” values)
Annual Capital Expenditures Survey (ACES)
• Sample Survey (Stratified SRS-WOR)– ACE-1: Employer companies– ACE-2: Non-employer companies (not discussed)
• New sample selection each year• Total and year-to-year change estimates
– Total Capital Expenditures– Structures (New and Used)– Equipment (New and Used)
Capital Expenditures Data
• Characterized by• Low year-to-year correlation (same
company)• Weak association with available auxiliary
data• Editing procedures focus on additivity• Outlier correction at micro-level
Bivariate Comparisons
Robust Regression
Resistant Fences
HB Edit
Structures/Total New Structures/Structures
New Structures/Used Structures
Equipment/Total New Equipment/Equipment
• Resistant Fences: (Symmetric or Asymmetric) (Inner or Outer)
• HB Edit: (U = 0.3 or 0.5) (c = 10 or 20 )
Results – Individual Tests
• Robust Regression prone to swamping– High Type I error rate (false rejects)
• Comparable performance with Asymmetric Inner Fences and HB Edit (U = 0.3, c = 10)– Low Type I error rates– High Hit Rates– High Type II error rates
• Other variations of Resistant Fences and HB edit not as good
Results – All-Tests
• Very large Type II error rates (approx. 50%)• Robust regression• Symmetric resistant outer fences• HB edit with c = 20
• Improved Type II error rates (30% - 40%)• Asymmetric inner fences • HB edit (U = 0.3, C=10)
Multivariate Results
• Original Data: considered methods ineffective• Log-transformed data: improved performance (MCD and MVE)
– Reduced Type I error rates– Comparable Type II error rates (to original-data MCD and MVE)
Conservative Results: 2002
0
0.2
0.4
0.6
0.8
1
Production-MD MCD (original) MVE (original) MCD (log-transformed)
MVE (log-transformed)
Type I Error Rates Type II Error Rates
Multivariate Versus Bivariate:Different Outcomes (Conservative)
Combined HB edits flag more “outliers”:– Higher Type I error rate – Lower Type II error rates for the complete set of HB edits
Counts of Non-Flagged Outliers Type I Errors (False Rejects)
8
0
11
4
2002 2003
HB MVE
Counts of Missed OutliersType II Errors (False Accepts)
13 14
0
4
2002 2003
HB MVE
Comments• Economic data with inconsistent statistical
association between items in each collection period • Critical values must be determined by the data set at
hand (no “hard-coding”)• Dynamically
– Standardize the comparisons (HB edit, log transformation)– Compute outlier limits
• Could try hybrid approach:– Multivariate a few current cell ratio tests with the HB edit – Perform all bivariate tests, but unduplicate cells before
analyst review
Final Thoughts/Next Steps
• Examine one set of economic data and considered only two separate collections from this program.
• Extrapolation would be foolish• My results need to be validated on other
economic data sets – a more typical periodic business survey and/or – a well-constructed simulation study
Any Questions?
Katherine Jenny [email protected]