Download - The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R
Revolution Confidential
T he R is e in Multiple B irths in the U.S .: A n A nalys is of a
Hundred-Million B irth R ec ords with R
S us an I. R anney, P h.D.
J S M 2012
Revolution Confidential T he Tools
Open Source R Flexible, powerful, great graphics Great for prototyping, not very scalable
RevoScaleR (with Revolution R Enterprise) Efficient file format (.xdf) Functions of accessing/importing external data sets
(fixed format & delimited text, SPSS, SAS, ODBC) Very fast, parallelized, distributed analysis functions
(summary stats, crosstabs/cubes, linear models, kmeans clustering, logistic regression, glm)
2
Revolution Confidential T he Data
Public-use data sets containing information on all births in the United States for each year from 1985 to 2009 are available to download: http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm
“These natality files are gigantic; they’re approximately 3.1 GB uncompressed. That’s a little larger than R can easily process” – Joseph Adler, R in a Nutshell
3
Revolution Confidential T he U.S . B irth Data (c ontinued)
Data for each year are contained in a compressed, fixed-format, text files
Typically 3 to 4 million records per file Variables and structure of the files sometimes change
from year to year, with birth certificates revised in1989 and 2003. Warnings:
NOTE: THE RECORD LAYOUT OF THIS FILE HAS CHANGED SUBSTANTIALLY. USERS SHOULD READ THIS DOCUMENT CAREFULLY. Reporting can differ state-to-state for any given year
4
Revolution Confidential T he P roc es s
Basic “life cycle of analysis” Import and combine data 25 years Over 100 million obs.
Check and clean data Basic variable summaries Big data logistic/glm regressions
All on my laptop Option to distribute computations
to cluster
5
Revolution Confidential T he Ques tion
6
CDC Report in Jan. 2012
Revolution Confidential T he Ques tion
What accounts for the increase in multiple births in the United States? Can we separate out effects of mother’s and father’s ages, race/ethnicity? Examine time trends by sub-group, assumed to be associated with fertility treatment CDC finding: The older age of women at childbirth accounts for only about 1/3 of the rise in twinning over 30 years (but this mixes in the increased rate of “twinning” for older women)
7
Revolution Confidential
Importing the U.S . B irth Data for Us e in R
8
Revolution Confidential E xample of Differenc es for Different Years
To create common variables across years, use common names and new factor levels for ‘colInfo’ in rxImport function. For example:
For 1985: SEX = list(type="factor", start=35, width=1, levels=c("1", "2"), newLevels = c("Male", "Female"), description = "Sex of Infant“)
For 2003: SEX = list(type="factor", start=436, width=1, levels=c("M", "F"), newLevels = c("Male", "Female"), description = "Sex of Infant”)
9
Revolution Confidential C reating Trans formed Variables on Import
Use standard R syntax to create transformed variables.
For example, create a factor for Mom’s Age using the imported MAGER integer variable:
MomAgeR7 = cut(MAGER, breaks =c(0, 19, 24, 29, 34, 39, 98, 99), labels = c("Under 20", "20-24", "25-29", "30-34", "35-39", "Over 39", "Missing"))
Create binary variable for “IsMultiple” IsMultiple = DPLURAL_REC != 'Single'
10
Revolution Confidential S teps for C omplete Import Lists for column information and transforms are
created for 3 base years: 1985, 1989, 2003 when there were very large changes in the structure of the input files Changes to these lists are made where
appropriate for in-between years A test script is run, importing only 1000
observations per year for a subset of years Full script is run, importing each year, sorting
according to key demographic characteristics, and appending to a master .xdf file
11
Revolution Confidential
E xamining and C leaning the B ig Data F ile
12
Revolution Confidential E xamining B as ic Information Basic file information >rxGetInfo(birthAll) File name: C:\Revolution\Data\CDC\BirthUS85to09S.xdf Number of observations: 100672041
Number of variables: 50 Number of blocks: 215
Use rxSummary to compute summary statistics for continuous data and category counts for each of the factors (about 4 minutes on my laptop)
rxSummary(~., data=birthAll, blocksPerRead = 10)
13
Revolution Confidential E xample of S ummary S tatis tic s
DadAgeR8 Counts
Under 20 3226602
20-24 15304803
25-29 23805056
30-34 23179418
35-39 13289015
40-44 4984146
Over 44 2140207
Missing 14742794
14
MomAgeR7 Counts
Under 20 11918891
20-24 25975642
25-29 28701398
30-34 22341530
35-39 9788753
Over 39 1945827
Missing 0
Revolution Confidential His tograms by Year
Easily check for basic errors in data import (e.g. wrong position in file) by creating histograms by year – very fast (just seconds on my laptop) Example: Distribution of mother’s age by
year. Use F() to have the integer year treated as a factor.
rxHistogram(~MAGER| F(DOB_YY), data=birthAll, blocksPerRead = 10, layout=c(5,5))
15
Revolution Confidential A ge of Mother Over T ime
16
Revolution Confidential Drill Down and E xtrac t S ubs amples
Take a quick look at “older” fathers: rxSummary(~F(UFAGECOMB),
data=birthAll,
blocksPerRead = 10)
What’s going on with 89-year old Dads? Extract a data frame:
dad89 <- rxDataStep(
inData = birthAll,
rowSelection = UFAGECOMB == 89,
varsToKeep = c("DOB_YY", "MAGER",
"MAR", "STATENAT", "FRACEREC"),
blocksPerRead = 10)
17
Dad’s Age Counts 80 141 81 108 82 81 83 74 84 56 85 43 86 43 87 26 88 27 89 3327
Revolution Confidential Year and S tate for 89-Year-Old F athers rxCube(~F(DOB_YY):STATENAT, data=dad89, removeZeroCounts=TRUE)
18
F_DOB_YY STATENAT Counts 1990 California 1 1999 California 1 2000 California 1 1996 Hawaii 1 1997 Louisiana 1 1986 New Jersey 1 1995 New Jersey 1 1996 Ohio 1 1989 Texas 3316 1990 Texas 1 2001 Texas 1 1985 Washington 1
Revolution Confidential
Demographic s of Multiple B irths 1985-2009
19
Revolution Confidential Unit of Obs ervation: Delivery
Appropriate unit of observation is the delivery (resulting in 1 or more live births) rather than the individual birth. Use 1/Plurality as probability weight Alternative: Look at nearby records to
compute the “Reported Delivery Birth Order” (RDPO), then select on only the 1st born in the delivery for the analysis
20
Revolution Confidential
21
Revolution Confidential
22
Revolution Confidential
23
Revolution Confidential
Multiple B irths L ogis tic R egres s ion & P redic tions
24
Revolution Confidential L ogis tic R egres s ion Logistic Regression Results for: IsMultiple ~ DadAgeR8 + MomAgeR7 +
FRaceEthn + MRaceEthn +
DadAgeR8:FRaceEthn:MNTHS_SINCE_JAN85 + MomAgeR7:MRaceEthn:MNTHS_SINCE_JAN85
File name: C:\Revolution\Data\CDC\BirthUS85to09R.xdf
Probability weights: PluralWeight
Dependent variable(s): IsMultiple
Total independent variables: 118 (Including number dropped: 12)
Number of valid observations: 100672041
Number of missing observations: 0 -2*LogLikelihood: 14555447.8514
(Residual deviance on 100671935 degrees of freedom)
About 6 minutes on my laptop
25
Revolution Confidential
C ounts of Deliveries by all Demographic C ombinations by Year rxCube(~DadAgeR8:MomAgeR7:FRaceEthn:MRaceEthn:F(DOB_YY),
data = birthAllR, pweights = "PluralWeight",
blocksPerRead = 10)
Under 10 seconds to compute on my laptop Resulting data frame has 50,400 rows
representing all the demographic combinations for each of the 25 years 44,661 have counts <= 1000 Provides input data for predictions and
weights for aggregating predictions
26
Revolution Confidential P redic t and A ggregate by Year predOut <- rxPredict(logitObj, data = catAllDF) Create predictions for each detailed
demographic group Aggregate for each year using population
percentages for each detailed group for each year
Compare with actuals Perform “What if?” scenarios Scenario 1: Only change in demographic; no
“fertility-treatment” time trand
27
Revolution Confidential
28
Revolution Confidential
Drill Down: L ook at P redictions for S pec ific G roups
Select out predicted values from prediction data frame for detailed groups: Age of mother and father Race/ethnicity of mother and father
29
Revolution Confidential
30
Revolution Confidential
31
Revolution Confidential
Revolution Confidential
C onc lus ions from L ogis tic R egres s ion
Dads matter: Use of fertility treatment associated with older dads as well as older Mom’s
Hispanics show relatively small increase in multiples for both younger and older couples
Asian pattern similar to whites, but lower for all age groups
Black show similar increase in multiples for both younger and older couples Other Factors Related to Multiples? (Diet, high BMI,
other genetic factors)
32
Revolution Confidential
Revolution Confidential
B ut How Many B abies ?
33
Revolution Confidential
Revolution Confidential
G L M Tweedie Model: How Many B abies ?
When power parameter is between 1 and 2, Tweedie is a compound Poisson distribution Appropriate for data with positive data that
also includes a ‘clump’ of exact zeros Dependent variable: Number of additional
babies (Plurality – 1) Same independent variables
34
Revolution Confidential
Revolution Confidential
35
Revolution Confidential
Revolution Confidential
F urther R es earc h
Data management Further cleaning of data Import more variables Import more years (additional use of weights)
Multiple Births Analysis More variables (e.g., proxies for fertility treatment
trends) Investigation of sub-groups (e.g., young blacks) Improved computation of number of births per
delivery Other analyses with birth data
36
Revolution Confidential
Revolution Confidential
S ummary of A pproac h “Unpredictability” of multiple births requires
large data set to have the power to capture effects
Significant challenges in importing and cleaning the data – using R and .xdf files makes it possible
Even with a huge data, “cells” of tables looking at multiple factors can be small
Using combined.xdf file, we can use individual-level analysis to examine conditional effects of a variety of factors
37
Revolution Confidential
Revolution Confidential
R eferenc es Martin JA, Hamilton BE, Osterman MJK. Three
decades of twin births in the United States, 1980-2009. NCHS data brief, no. 80. Hyattsville, MD: National Center for Health Statistics. 2012.
Blondel B, Kaminiski, M. Trends in the Occurrence, Determinants, and Consequences of Multiple Births. Seminars in Perinatology. 26(4):239-49, 2002.
Vahratian, Anjel. Utilization of Fertility-Related Services in the United States. Fertil Steril. 2008 October; 90(4):1317-1319.
38
Revolution Confidential
Revolution Confidential
T hank you! R-Core Team R Package Developers R Community Revolution R Enterprise Customers and Beta
Testers Colleagues at Revolution Analytics Contact: [email protected] 39