microdata simulation for confidentiality of tax returns using quantile regression and hot deck...

Post on 26-Dec-2015

217 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and

Hot Deck

Jennifer Huckett

Iowa State University

June 20, 2007

Outline

• Motivation

• Disclosure Limitation Methods

• Risk Assessment

• Simulation Study

• Results & Conclusions

Motivation• Iowa Department of Revenue (IDR)

– Collects and maintains individual tax return data

• Legislative Services Agency (LSA)– Examines impact of tax law changes on liability

• Current system– LSA submits requests to IDR– IDR computes liability, reports to LSA– Occurs several times each year– Inefficient for both IDR and LSA

• Solutions– Secure/remote access server

• Data are not released

• Some analyses suppressed

– Statistical disclosure limitation (SDL)• Tabular

• Microdata– enable IDR to provide LSA with data set

– allow LSA to compute liability with ease and accuracy

– MUST ENSURE CONFIDENTIALITY of RECORDS!

Establishment Connection

• Very skew distributions, unusual associations among distributions

• Groups of variables are related to one another in unusual ways

• Similar to business tax data or business expenditure/revenue data

• Confidentiality is critical

Traditional Approaches

• Recoding (e.g. aggregation)

• Noise addition

• Data swapping

• Data suppression

• Imputation

• Combinations of these

Our Approach

• Synthetic microdata simulation– Retain key demographic variables– Simulate values for some variables

• Quantile regression conditional on key variables

• Compute fitted values at selected quantiles

– Impute values for remaining variables • Hot deck + rank swap

• Hot deck based on simulated income variables

Quantile Regression

– = “tilted absolute value function” for quantile

– = linear function of predictors (xi)

• performed in R– quantreg package– rq function

Quantile Regression, Koenker 2004

)),((min ii xy

)ˆ( yyi ),( ix

th

Simulate via Quantile Regression

• Estimate for quantiles from the set

• For each record on variable y

– Randomly select ~ Uniform(0,1)

– Compute fitted given x at above and below

– Interpolate to obtain = simulated value

={0.01, 0.02, ...,0.99}

*ˆy

**y

),( ix

IDR Application: Key Demographic Variables

• Number of dependents– 0, 1, 2,…

– Categorized into • 0

• 1

• ≥2

• County– 1,…,99

– Categorized into 4 population size groups

• State filing status1. single2. married filing joint3. married filing separate

on combined return4. married filing separate

returns5. head of household6. widow(er) with

dependent child– Categorized into

• 1• 2 and 3• 4, 5, and 6

IDR Application: Quantile Regression for wages

]4[]3[]2[]6,5,4[

]3,2[]2[#]1[#

111098

7654

43

32

210

countyIcountyIcountyIsfsI

sfsIdepIdepIageageageagewages

• Hot Deck– Mahalanobis distance

– closest 20 records

• Rank Swap– compute sample rank, r

– draw random rank, r*, from discrete Uniform[r-10, r+10]

– impute value from record with rank r*

IDR Application: Hot Deck and Rank Swap for Federal Tax

)()'(),( 1jixxji xxSxxjid

Disclosure Risk Measurement

• Using methods detailed in Reiter (2005) and Duncan and Lambert (1986, 1989)

• Examine specific records– Original records– Released records – Model intruder behavior to assess disclosure

risk

• Simulation Study

Original and Released Records

),|Pr( ZtjJ

Intruder Behavior

• Target record, t– Intruder has information on target

– Attempts to match t in released records

• Released records j=1,…,r in Z• Probability that record j belongs to target t is

• As – probability decreases

– disclosure risk decreases

Simulation Study

Schemes for SDL influence divisions of A into Ap

(available, perturbed) and Ad (available, unperturbed).

SDL Schemes in Simulation Study

• No SDL• Swap 30% marital status• Swap 30% marital status and minority• Recode age into 5 year intervals• Recode age into 5 year intervals and swap

30% marital status and minority• Simulation via quantile regression and hot

deck

Targets

• Intruder has information on target, t, and wants to match with released records

• Consider a few targets– Unique record– Rare record– Common record

Results from Simulation Study

),|Pr( ZtjJ

target No SDLMarital

swapMarital and

minority swapAge

recode

Swaps and

recode

Quantile regression

and hot deck

unique1 1 0.1046 1 0.0178  0.0895

rare0.3333 0.1044 0.1304 0.0526 0.0225

 

0.0016

common0.0385 0.0320 0.0320 0.0068 0.0055

 

0.0008

Conclusions & Future Work

• Risk behaves as we expect– increased SDL– decreased disclosure risk (except for unique!)

• Perform SDL techniques to American Community Survey data at US Census Bureau

• Compare traditional techniques to quantile regression and hot deck by computing risk

• Measure utility of released data

Acknowledgements

• Iowa Department of Revenue

• Iowa’s Legislative Services Agency

• National Institute of Statistical Sciences

• US Census Bureau Dissertation Fellowship Award

References

• Duncan,G.T. and Lambert, D. 1986. “Disclosure-Limited Data Dissemination,” Journal of the American Statistical Association, 81, 10-28.

• Duncan,G.T. and Lambert, D. 1989. “The Risk of Disclosure for Microdata,” Journal of Business and Economic Statisistics, 7, 207-217.

• Koenker, R. 2005. “Introduction,” Quantile Regression, Econometric Society Monograph Series, Cambridge University Press.

• Reiter, J.P. 2005. “Estimating Risks of Identification Disclosure in Microdata”, Journal of the American Statistical Association, 100, 472, 1103-1113.

top related