myoung ho lee

Myoung Ho Lee

STATISTICAL METHODS FOR REDUCING BIAS IN WEB SURVEYSSTATISTICAL METHODS FOR REDUCING BIAS IN WEB SURVEYS

13rd September 2012

Introduction

Web surveys

Methodology

- Propensity Score Adjustment

- Calibration (Rim weighting)

Case Study

Discussion and Conclusion

Contents 2

• Trends in Data Collection

Paper and Pencil => Telephone => Computer

=> Internet (Web)

• Internet penetration

Introduction 3

Pros and Cons of Web surveys

• Pros - Low cost and Speed

- No interviewer effect

- Visual, flexible and interactive

- Respondents convenience

• Cons - Quality of sample estimates

Web surveys may be solutions! But, Problems!!!

Introduction 4

Previous Studies• Harris Interactive (2000 ~ )

• Lee (2004), Lee and Valliant (2009)

• Hur and Cho (2009)

• Bethlehem (2010), etc.

Lee and Valliant (2009) : good performance in simulation

But, most other results do not seem to be so good.

- Malhotra and Krosnick (2007), Huh and Cho (2009)

Introduction 5

Volunteer Panel Web Survey Protocol (Lee, 2004)

Web surveys 6

Under-coverage Self-selection Non-response

Challenge: Fix anticipated biases in web survey estimates that

result from under-coverage, self-selection and non-response

Methodology

Proposed Adjustment Procedure for Volunteer Panel Web surveys (Lee, 2004)

Propensity Score Adjustment (PSA)

• Original idea : Comparison of two groups, treatment and control,

in observational studies (Rosenbaum and Rubin, 1983)

- by weighting using all auxiliary variables that are thought to

account for the differences

• In context of web surveys, this technique aims to correct for

differences between offline people and online people

- by certain inclinations of people who participate in the volunteer

panel web survey

Methodology 8

• “Webographic” : overlapping variables between

web and reference survey

- To capture the difference between online and

offline populations (Schonlau et al., 2007)

- For example, “Do you feel alone?”, “In the last month

have you read a book?”…… (Harris Interactive)

Methodology 9

• Propensity score :

It is assumed that zi are independent given a set of covariates (xi)

• ‘Strong ignorability assumption’ : Response variable is conditionally

independent of treatment assignment given the propensity score.

Methodology 10

Logistic regression model :

Variable Selection

• Include variables related to not only treatment assignment

but also response in order to satisfy the ‘strong ignorability

assumption’

(Rosenbaum and Rubin, 1984; Brookhart et al., 2006)

Methodology 11

Variable Selection

• In practice, stepwise selection method has been often used to

develop good predictive models for treatment assignment

• Most previous web studies : Use of all available covariates (5-30)

• Huh and Cho (2009) : 9 or 7 out of 123 covariates were chosen

by their “subjective” views

Methodology 12

Variable Selection

• Stepwise logistic regression using SIC

- large number of covariates, little theoretical guidance

• LASSO (PROC GLMSELECT in SAS)

- a good alternative to stepwise variable selection

• Boosted tree (“gbm” in R)

- determine a set of split conditions

Methodology 13

Applying methods for PSA

• Inverse propensity scores as weights

- weights :

- then, multiply them with sampling weights

• Subclassification (Stratification)

- subgrouping homogenous people into each stratum

Methodology 14

• Subclassification (Stratification)

1. Combine both reference and web data into one

2. Estimate each propensity score from the combined sample

3. Partition those units into C subclasses according to ordered

values, where each subclass has about the same number of units

4. Compute adjustment factor, and apply to all units in the cth

subclass.

5. Multiply the factor with sampling weights to get PSA weights

Methodology 15

Calibration (Rim weighting)

• Matching sample and population characteristics only with

respect to the marginal distributions of selected covariates

• Little and Wu (1991)

- Iterative algorithm to alternatively adjust weights according to

each covariates’ marginal distribution until convergence

Methodology 16

Case Study

• Reference survey : “2009 Social Survey” by Statistics Korea

- Culture & Leisure, Income & Consumption, etc.

- All persons aged +15 in 17,000 households

- Sample size : 37,049

- Face-to-face mode

- Post-stratification estimation

- Assumed to be “True”

17Case Study

• Web survey

- Recruiting volunteers from web sites (6,854 households)

- Systematic sampling with non-equal selection probabilities

(inverse of rim weights using region, age, gender)

- Sample size : 1,500 households and 2,903 respondents

- Overlapping covariates : 123

Case Study 18

Case Study – Model Selecion 19

M1 = Stepwise(22), M2 = Stepwise(17), M3 = LASSO(12), M4 = Boosted tree(18)

Assessment methods

• 16 combinations : (Model 1, 2, 3 and 4) × (Inverse weighting

and Subclassification) × (No Calibration and Rim weighting)

• 12 response variables

• Percentage of bias reduction

Case Study 20

M1 M2 M3 M4

SubclassificationInverse weighting

CalibrationPSA alone

M1 M2 M3 M4

SubclassificationInverse weighting

M1 M2 M3 M4M1 M2 M3 M4

Percentage of bias reduction

• Why PSA doesn’t work well alone ???

Discussion

Propensity scores for each survey in 5 strata in Model 1

What are the possible solutions to fix poor PSA?

• Setting maximum value of weight

• Different subclassification algorithm

- Formula for the variance of weights that depends on both the

number of cases from each group within a stratum and the

variability of propensity scores with the stratum

• Matching PSA

- limited number of treated group members and a larger number

of control group members

Discussion 23

• Violation of some assumptions

- ‘Strong ignorability assumption’

- Missing at random (MAR)

- Mode effects

• Variable selection (What are webographic variables?)

- Models affect the performance of PSA significantly

- Maybe expert knowledge, not statistical approach

- Further studies are needed

Discussion 24

• Web surveys have attractive advantages

• However, bias from self-selection, under-coverage, non-responses

• According to my case study results,

=> It seems to be difficult to apply PSA to “real world” just now

• Further researches on webographic variables and different PSA

methods are needed

Conclusion 25

myoung ho lee

Documents

1 decision tree learning soongsil university, seoul gun ho...

sang-ho lee*, hong-bae moon, chang-soo kim

mats persson (1) , hyo-june lee (2) , and wilson ho (2)

hun myoung park, ph.d

supplementary materials for - science...2015/12/02 ·...

technology convergence – imt-2020, cloud computing and...

record ho. 2313. - washington and lee university

imaginings by leonid afremov - ho cheung lee (peter),...

· lee, sze lung tam, wang sing kwok, wing sun tam, chi...

the oxford guide to financial modeling by ho & lee chapter...

title ip cheuk ho(02592263) lee yue sang(02780443) poon wai...

two enzymes in one ho hee jang, kyun oh lee, yong hun chi,...

water harvest via dewing -...

innate lymphoid cells in mucosal immunity and...

birefringence-dependent linearly-polarized emission in a...

myoung ho lee photographe -...

robert h. wagoner, pi, myoung-gyu lee, hojun lim department...

the li-young lee lowdown: stephanie ho and lauren ellis a...

myoung ho lee · overwhelming feeling that comes from...

editorial functional carbon...