investigation of macro editing techniques for outlier detection in survey data

30
Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for Economic Programs

Upload: corby

Post on 10-Feb-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Investigation of Macro Editing Techniques for Outlier Detection in Survey Data. Katherine Jenny Thompson Office of Statistical Methods and Research for Economic Programs. Simplified Survey Processing Cycle. Data Collection/ Analyst Review. Micro-editing And Imputation. Individual Returns. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Investigation of Macro Editing Techniques for Outlier Detection in

Survey Data

Katherine Jenny ThompsonOffice of Statistical Methods and Research for

Economic Programs

Page 2: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Simplified Survey Processing Cycle

Data Collection/Analyst Review Micro-editing

And ImputationIndividual Returns

Macro-editing Tabulated Initial

Estimates

Analyst InvestigationAnd Correction

Publication Estimates

Page 3: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Identifying Outlying Estimates

• Set of Estimates– Unknown parametric distribution (robust)– Contains outliers (resistant)

• Outlier-identification problems (Multiple Outliers)– Masking: difficult to detect an individual outlier– Swamping: too many false outliers flagged

Page 4: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Outlier Detection Approaches

• Sets of “bivariate” (Ratio) comparisons – Same estimate from two consecutive

collection periods (historic cell ratios)– Different estimates in same collection

period (current cell ratios)• Multivariate comparisons

– Current period data

Page 5: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Method for Bivariate Comparisons

• Resistant Fences Methods– Symmetrized Resistant fences– Asymmetric Fences

• Robust Regression• Hidiroglou-Berthelot Edit

Page 6: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Bivariate Comparisons (Current Cell Ratios)

• Linear relationship between payroll and employment• No intercept

Paired Estimates

0

1000

2000

3000

4000

5000

6000

7000

0 20 40 60 80 100 120

Total Employment

Ann

ual P

ayro

ll

Paired Estimates

Page 7: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

“Traditional” Ratio Edit (Current Cell Ratio)

0

1000

2000

3000

4000

5000

6000

7000

8000

0 20 40 60 80 100 120

Total Employment

Annu

al P

ayro

ll

Paired Estimates Lower Tolerance Upper Tolerance

• “Cone-shaped” tolerances• Goes through origin• Strong statistical association

Acceptance Region

Outlier Region

Outlier Region

Page 8: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Resistant Fences Methods

q25 q75

q25-1.5H q75+1.5H

• Different numbers of interquartile ranges (1.5 = Inner, 3 = Outer)

• Implicitly assumes symmetry

• May want to “symmetrize”, apply rule, use inverse transformation

Page 9: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Asymmetric Fences Methods

q75+3 (q75- m)q25+3 (m – q25)

• Different numbers of interquartile ranges (3 = Inner, 6 = Outer)

• Incorporates skewness of distribution in outlier rule (“Fences”)

Page 10: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Robust Regression

• Least Trimmed Squares Robust Regression • Resistant (minimizes median residual)• Outlier = |residual| 3 robust M.S.E.

0

1000

2000

3000

4000

5000

6000

7000

0 20 40 60 80 100 120

Total Employment

Annu

al P

ayro

ll

Paired Estimates Robust Regression Line

Page 11: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Issue at Origin (Historic Cell Ratio)

0

10

20

30

40

50

60

70

80

0 5 10 15 20 25 30 35

Prior Month's Number of Employees

Curr

ent M

onth

's N

umbe

r of

Em

ploy

ees

Page 12: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Hidiroglou-Berthelot (HB) Edit

-250

-200

-150

-100

-50

0

50

0 20 40 60 80 100 120

Employment

HB "

Effe

cts"

Upper Bound Lower Bound Effects

• Accounts for magnitude of unit (variability at origin)

Page 13: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Hidiroglou-Berthelot (HB) Edit• Two-step transformation (Ei)

– Centering transformation on ratios– Magnitude transformation that accounts for the relative

importance of large cases

• Asymmetric Fences “Type” Outlier Rule

• Key ParameterU = magnitude transformation parameter (0 U 1)C = controls width of outlier region

Page 14: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Multivariate Methods: Mahalanobis Distance

• Multivariate normal (,)

– T(X) estimates – C(X) estimates – p is the number of distinct variables (items)

• Prone to masking (difficult to detect individual outliers)

2~))()(())(( piii TxCTxMD XXX

Page 15: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Robust Alternatives • M-estimation (not considered)• “Production Method”• Minimum Volume Ellipse (MVE)

– Resistant (50% breakdown) and robust• Minimum Covariance Determinant (MCD)

– Resistant (50% breakdown) and robust

• Assumption of Normality– Log-transformation

Page 16: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Evaluation: Classify Item EstimatesInput Value

ReportedFinal Value

Tabulated

RatioInput/Final

OutlierPotentialOutlierNot an Outlier

0

5

10

15

20

25

30

35

40

45

50

Ratio Values

Freq

uenc

y C

ount

s

Page 17: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Evaluation: Classify Ratios (Bivariate)

• Conservative– Ratio is “outlier” if numerator or

denominator is an outlier• Anti-Conservative

– Ratio is “outlier” if numerator or denominator is an outlier or a potential outlier

Page 18: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Evaluation: Classify Records (Multivariate)• Conservative

– Record is “outlier” at least one estimate is an outlier

• Anti-Conservative– Record is “outlier” at least one estimate is

an outlier or a potential outlier

Page 19: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Evaluation Statistics: Bivariate Comparisons

• Individual Test Level• Type I Error Rate: proportion of false rejects• Type II Error Rate: proportion of false accepts• Hit Rate: proportion of flagged estimates that are

outliers

• All-Test Level• All-item Type II error rate

Page 20: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Evaluation Statistics: Multivariate Comparisons

• Type I error rate: the proportion of non-outlier records that are flagged as outliers

• Type II error rate: the proportion of outlier records that are not flagged as outliers (missed “bad” values)

Page 21: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Annual Capital Expenditures Survey (ACES)

• Sample Survey (Stratified SRS-WOR)– ACE-1: Employer companies– ACE-2: Non-employer companies (not discussed)

• New sample selection each year• Total and year-to-year change estimates

– Total Capital Expenditures– Structures (New and Used)– Equipment (New and Used)

Page 22: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Capital Expenditures Data

• Characterized by• Low year-to-year correlation (same

company)• Weak association with available auxiliary

data• Editing procedures focus on additivity• Outlier correction at micro-level

Page 23: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Bivariate Comparisons

Robust Regression

Resistant Fences

HB Edit

Structures/Total New Structures/Structures

New Structures/Used Structures

Equipment/Total New Equipment/Equipment

• Resistant Fences: (Symmetric or Asymmetric) (Inner or Outer)

• HB Edit: (U = 0.3 or 0.5) (c = 10 or 20 )

Page 24: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Results – Individual Tests

• Robust Regression prone to swamping– High Type I error rate (false rejects)

• Comparable performance with Asymmetric Inner Fences and HB Edit (U = 0.3, c = 10)– Low Type I error rates– High Hit Rates– High Type II error rates

• Other variations of Resistant Fences and HB edit not as good

Page 25: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Results – All-Tests

• Very large Type II error rates (approx. 50%)• Robust regression• Symmetric resistant outer fences• HB edit with c = 20

• Improved Type II error rates (30% - 40%)• Asymmetric inner fences • HB edit (U = 0.3, C=10)

Page 26: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Multivariate Results

• Original Data: considered methods ineffective• Log-transformed data: improved performance (MCD and MVE)

– Reduced Type I error rates– Comparable Type II error rates (to original-data MCD and MVE)

Conservative Results: 2002

0

0.2

0.4

0.6

0.8

1

Production-MD MCD (original) MVE (original) MCD (log-transformed)

MVE (log-transformed)

Type I Error Rates Type II Error Rates

Page 27: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Multivariate Versus Bivariate:Different Outcomes (Conservative)

Combined HB edits flag more “outliers”:– Higher Type I error rate – Lower Type II error rates for the complete set of HB edits

Counts of Non-Flagged Outliers Type I Errors (False Rejects)

8

0

11

4

2002 2003

HB MVE

Counts of Missed OutliersType II Errors (False Accepts)

13 14

0

4

2002 2003

HB MVE

Page 28: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Comments• Economic data with inconsistent statistical

association between items in each collection period • Critical values must be determined by the data set at

hand (no “hard-coding”)• Dynamically

– Standardize the comparisons (HB edit, log transformation)– Compute outlier limits

• Could try hybrid approach:– Multivariate a few current cell ratio tests with the HB edit – Perform all bivariate tests, but unduplicate cells before

analyst review

Page 29: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Final Thoughts/Next Steps

• Examine one set of economic data and considered only two separate collections from this program.

• Extrapolation would be foolish• My results need to be validated on other

economic data sets – a more typical periodic business survey and/or – a well-constructed simulation study

Page 30: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Any Questions?

Katherine Jenny [email protected]