investigation of macro editing techniques for outlier detection in survey data

Investigation of Macro Editing Techniques for Outlier Detection in

Survey Data

Katherine Jenny ThompsonOffice of Statistical Methods and Research for

Economic Programs

Simplified Survey Processing Cycle

Data Collection/Analyst Review Micro-editing

And ImputationIndividual Returns

Macro-editing Tabulated Initial

Estimates

Analyst InvestigationAnd Correction

Publication Estimates

Identifying Outlying Estimates

• Set of Estimates– Unknown parametric distribution (robust)– Contains outliers (resistant)

• Outlier-identification problems (Multiple Outliers)– Masking: difficult to detect an individual outlier– Swamping: too many false outliers flagged

Outlier Detection Approaches

• Sets of “bivariate” (Ratio) comparisons – Same estimate from two consecutive

collection periods (historic cell ratios)– Different estimates in same collection

period (current cell ratios)• Multivariate comparisons

– Current period data

Method for Bivariate Comparisons

• Resistant Fences Methods– Symmetrized Resistant fences– Asymmetric Fences

• Robust Regression• Hidiroglou-Berthelot Edit

Bivariate Comparisons (Current Cell Ratios)

• Linear relationship between payroll and employment• No intercept

Paired Estimates

0

1000

2000

3000

4000

5000

6000

7000

0 20 40 60 80 100 120

Total Employment

Ann

ual P

ayro

ll

Paired Estimates

“Traditional” Ratio Edit (Current Cell Ratio)

0

1000

2000

3000

4000

5000

6000

7000

8000

0 20 40 60 80 100 120

Total Employment

Annu

al P

ayro

ll

Paired Estimates Lower Tolerance Upper Tolerance

• “Cone-shaped” tolerances• Goes through origin• Strong statistical association

Acceptance Region

Outlier Region

Outlier Region

Resistant Fences Methods

q25 q75

q25-1.5H q75+1.5H

• Different numbers of interquartile ranges (1.5 = Inner, 3 = Outer)

• Implicitly assumes symmetry

• May want to “symmetrize”, apply rule, use inverse transformation

Asymmetric Fences Methods

q75+3 (q75- m)q25+3 (m – q25)

• Different numbers of interquartile ranges (3 = Inner, 6 = Outer)

• Incorporates skewness of distribution in outlier rule (“Fences”)

Robust Regression

• Least Trimmed Squares Robust Regression • Resistant (minimizes median residual)• Outlier = |residual| 3 robust M.S.E.

0

1000

2000

3000

4000

5000

6000

7000

0 20 40 60 80 100 120

Total Employment

Annu

al P

ayro

ll

Paired Estimates Robust Regression Line

Issue at Origin (Historic Cell Ratio)

0

10

20

30

40

50

60

70

80

0 5 10 15 20 25 30 35

Prior Month's Number of Employees

Curr

ent M

onth

's N

umbe

r of

Em

ploy

ees

Hidiroglou-Berthelot (HB) Edit

-250

-200

-150

-100

-50

0

50

0 20 40 60 80 100 120

Employment

HB "

Effe

cts"

Upper Bound Lower Bound Effects

• Accounts for magnitude of unit (variability at origin)

Hidiroglou-Berthelot (HB) Edit• Two-step transformation (Ei)

– Centering transformation on ratios– Magnitude transformation that accounts for the relative

importance of large cases

• Asymmetric Fences “Type” Outlier Rule

• Key ParameterU = magnitude transformation parameter (0 U 1)C = controls width of outlier region

Multivariate Methods: Mahalanobis Distance

• Multivariate normal (,)

– T(X) estimates – C(X) estimates – p is the number of distinct variables (items)

• Prone to masking (difficult to detect individual outliers)

2~))()(())(( piii TxCTxMD XXX

Robust Alternatives • M-estimation (not considered)• “Production Method”• Minimum Volume Ellipse (MVE)

– Resistant (50% breakdown) and robust• Minimum Covariance Determinant (MCD)

– Resistant (50% breakdown) and robust

• Assumption of Normality– Log-transformation

Evaluation: Classify Item EstimatesInput Value

ReportedFinal Value

Tabulated

RatioInput/Final

OutlierPotentialOutlierNot an Outlier

0

5

10

15

20

25

30

35

40

45

50

Ratio Values

Freq

uenc

y C

ount

s

Evaluation: Classify Ratios (Bivariate)

• Conservative– Ratio is “outlier” if numerator or

denominator is an outlier• Anti-Conservative

– Ratio is “outlier” if numerator or denominator is an outlier or a potential outlier

Evaluation: Classify Records (Multivariate)• Conservative

– Record is “outlier” at least one estimate is an outlier

• Anti-Conservative– Record is “outlier” at least one estimate is

an outlier or a potential outlier

Evaluation Statistics: Bivariate Comparisons

• Individual Test Level• Type I Error Rate: proportion of false rejects• Type II Error Rate: proportion of false accepts• Hit Rate: proportion of flagged estimates that are

outliers

• All-Test Level• All-item Type II error rate

Evaluation Statistics: Multivariate Comparisons

• Type I error rate: the proportion of non-outlier records that are flagged as outliers

• Type II error rate: the proportion of outlier records that are not flagged as outliers (missed “bad” values)

Annual Capital Expenditures Survey (ACES)

• Sample Survey (Stratified SRS-WOR)– ACE-1: Employer companies– ACE-2: Non-employer companies (not discussed)

• New sample selection each year• Total and year-to-year change estimates

– Total Capital Expenditures– Structures (New and Used)– Equipment (New and Used)

Capital Expenditures Data

• Characterized by• Low year-to-year correlation (same

company)• Weak association with available auxiliary

data• Editing procedures focus on additivity• Outlier correction at micro-level

Bivariate Comparisons

Robust Regression

Resistant Fences

HB Edit

Structures/Total New Structures/Structures

New Structures/Used Structures

Equipment/Total New Equipment/Equipment

• Resistant Fences: (Symmetric or Asymmetric) (Inner or Outer)

• HB Edit: (U = 0.3 or 0.5) (c = 10 or 20 )

Results – Individual Tests

• Robust Regression prone to swamping– High Type I error rate (false rejects)

• Comparable performance with Asymmetric Inner Fences and HB Edit (U = 0.3, c = 10)– Low Type I error rates– High Hit Rates– High Type II error rates

• Other variations of Resistant Fences and HB edit not as good

Results – All-Tests

• Very large Type II error rates (approx. 50%)• Robust regression• Symmetric resistant outer fences• HB edit with c = 20

• Improved Type II error rates (30% - 40%)• Asymmetric inner fences • HB edit (U = 0.3, C=10)

Multivariate Results

• Original Data: considered methods ineffective• Log-transformed data: improved performance (MCD and MVE)

– Reduced Type I error rates– Comparable Type II error rates (to original-data MCD and MVE)

Conservative Results: 2002

0

0.2

0.4

0.6

0.8

1

Production-MD MCD (original) MVE (original) MCD (log-transformed)

MVE (log-transformed)

Type I Error Rates Type II Error Rates

Multivariate Versus Bivariate:Different Outcomes (Conservative)

Combined HB edits flag more “outliers”:– Higher Type I error rate – Lower Type II error rates for the complete set of HB edits

Counts of Non-Flagged Outliers Type I Errors (False Rejects)

8

0

11

4

2002 2003

HB MVE

Counts of Missed OutliersType II Errors (False Accepts)

13 14

0

4

2002 2003

HB MVE

Comments• Economic data with inconsistent statistical

association between items in each collection period • Critical values must be determined by the data set at

hand (no “hard-coding”)• Dynamically

– Standardize the comparisons (HB edit, log transformation)– Compute outlier limits

• Could try hybrid approach:– Multivariate a few current cell ratio tests with the HB edit – Perform all bivariate tests, but unduplicate cells before

analyst review

Final Thoughts/Next Steps

• Examine one set of economic data and considered only two separate collections from this program.

• Extrapolation would be foolish• My results need to be validated on other

economic data sets – a more typical periodic business survey and/or – a well-constructed simulation study

Any Questions?

Katherine Jenny [email protected]

investigation of macro editing techniques for outlier detection in survey data

Documents

individual outlierswamping