analysis complex samples 131108

31
Using R to analyse complex survey samples Thomas Lumley Associate Professor of Biostatistics, University of Washington. R Core Development Team

Upload: noyeem-mahbub

Post on 03-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 1/31

Using R to analyse

complex survey samples

Thomas LumleyAssociate Professor of Biostatistics,

University of Washington.

R Core Development Team

Page 2: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 2/31

Outline• The R survey package

• Why has R become successful?

• Why open-source software matters to statistics

Page 3: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 3/31

R (needs no introduction)

• An open-source reimplementation of the S language

from Bell Labs• Initially a Kiwi creation, now used around the world

– 2008 Pickering Medal to Ross Ihaka for R

• Probably the most popular medium for distributing

new statistical methodology

– CRAN: 1500 packages

– Bioconductor: 500 packages

Page 4: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 4/31

http://faculty.washington.edu/tlumley/survey/ 

Page 5: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 5/31

Brief history• 2002: I visit Auckland, start writing survey package

• January 2003: first version released• July 2003: replicate weights

• April 2004: published in J. Stat. Software

• (US) Spring 2005: multistage sampling, calibration

• (US) Winter 2006: two-phase designs

• (NZ) Winter 2008: database-backed designs• August 2009: book (I hope)

Page 6: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 6/31

Design philosophyMostly comes from limited resources

– write in high-level language– code reuse to expose bugs

– keep data in memory (mostly)

– don’t optimize until someone complains (Moore’s Law)– emphasize features that look like biostatistics

Package is about 8000 lines of code– cf 250,000 for VPLX from US Census Bureau

– about 300,000 for all of R; 25,000,000 for SAS (!)

Page 7: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 7/31

Interesting features• Secondary analysis/modelling of large surveys

– graphics, smoothing– regression models

– analysis of multiply-imputed data

• Simulations– R programming language

• Calibration (raking, GREG) estimators

– including calibration for regression models

• Database-backed objects

– data loaded as needed from relational database

Page 8: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 8/31

Why me?[ie: Lumley? What does he know about surveys?]

Semiparametric model-based methods areconverging on design-based inference

– ‘sandwich’ variance estimators

– model-robustness– concept of parameters as functionals on distributions

– IPW in causal inference, missing data

– two-phase sampling in cohort studiesEmphases are different: that’s what users are for.

Page 9: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 9/31

User interface• Data and design meta-data are stored in a survey

design object– ensures meta-data and data are kept together

– subset operator sets up data for domain estimation

– post-stratification/calibration creates new object

• Data variables in the object are specified by model

formulas

Page 10: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 10/31

Example: NHANES IIIdhanes <- svydesign(id=~SDPPSU6, strata=~SDPSTRA6,

weight=~WTPFQX6, data=nhanes3, nest=TRUE)

svymean(~BMPWTMI+BMPHTMI, design=dhanes)

svyquantile(~BMPWTMI, design=dhanes, quantile=0.5)

svytotal(~factor(HAB1MI), design=dhanes)

adults <- subset(dhanes, HSAGEIR>18)

adults <- update(adults,

bmi= BMPWTMI/(BMPHTMI/100)^2 )

adults <- update(adults,bmigp=cut(bmi,c(0,18.5,25,30,Inf)))

svymean(~bmigp, adults)

svyby(~bmigp, ~HAB1MI, svymean, design=adults)

Page 11: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 11/31

Example: Californian schoolsdclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2,

data=apiclus2)

model1<-svyglm(api00~api99+emer,design=dclus2)

model2<-svyglm(api00~api99+meals+mobility+ell+

emer, design=dclus2)

model3<-svyglm(api00~api99+stype+emer,

design=dclus2)

summary(model1)

summary(model2)

summary(model3)

Page 12: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 12/31

Large data• With all data kept in memory

– on a laptop, NHANES-scale analyses feasible ifrelevant variables selected first

– inexpensive 64-bit Linux systems can handle millions

of records

• Database-backed

– variables loaded on-demand for each command

– hundreds of thousands of records possible on laptop

• 2007 BRFSS: 430,000 records (is just feasible)

Page 13: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 13/31

Database-backeddhanes <- svydesign(id=~SDPPSU6,

strata=~SDPSTRA6, weight=~WTPFQX6,

data=“set1”, dbtype=“ODBC”,

dbname=“nhanes3”, nest=TRUE)

• Specify a SQL database table or view as the data

source. Only read access is needed

• Design metadata is kept in memory, other

variables loaded only as needed

• Works with ODBC, JDBC, and directly with

Oracle, PostgreSQL and other popular databases

Page 14: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 14/31

Page 15: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 15/31

Data from NHIS: about 25k observations

Page 16: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 16/31

Page 17: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 17/31

Page 18: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 18/31

Page 19: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 19/31

Health insurance coverage (by age)

Page 20: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 20/31

Page 21: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 21/31

Why is R successful?

Page 22: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 22/31

Charlton Heston brings SAS down from Mt SinaiCharlton Heston brings SAS down from Mt Sinai

Page 23: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 23/31

R spreads through a terrified nationR spreads through a terrified nation

1998

Page 24: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 24/31

• Extensibility

• Cost• Rapid development

Network effects

Reasons for the R pandemic?

Page 25: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 25/31

Extensibility• Can users write extensions that look like built-in

functionality?

• Can users find these extensions?

• Is it easy to tell what extensions are installed and

to get rid of them?• Can old versions of the software co-exist with new

ones?

Page 26: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 26/31

Free (as in beer)• Price sensitivity should be lower for specialist

statisticians, and for large companies where

statistics is mission-critical

– but these are more likely to use R

• Students are price-sensitive

– low cost is useful in teaching

– academics learn computing from their PhD students

Page 27: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 27/31

Rapid development• User’s syntax is the same as developer’s language

– deliberate design for ‘slippery slope’ to programming

• Functional language, dynamic types

– slow, inefficient memory use– lack of side effects makes it very easy to use

– most of R is written in R

Page 28: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 28/31

Page 29: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 29/31

Why is open-source statistical

software important?

Page 30: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 30/31

Open-source• Three related benefits

– publication of novel methods

– dissemination of good statistical practice

– reproducibility

• Open-source platform is not required, but it helps

– need widely available platform

– need packaging system for distributable code– need archive of old platform versions

Page 31: Analysis Complex Samples 131108

8/12/2019 Analysis Complex Samples 131108

http://slidepdf.com/reader/full/analysis-complex-samples-131108 31/31

Code as language• Code describes exactly what analyses you did

– equations miss many practical aspects

– complete and precise descriptions in English are hard,

and more ugly than the code

• Code can be reused

– a problem should not need to be solved more than once

• Tools for communicating with others also help

when communicating with yourself