the rise in multiple births in the u.s.: an analysis of a hundred-million birth records with r
DESCRIPTION
Presentation by Sue Ranney of Revolution Analytics at JSM 2012, San Diego CA, Aug 1 2012. The Center for Disease Control and Prevention recently issued a report, widely cited in the popular press, on the increased incidence of multiple births in the United States over the last 30 years. Twin birth rates were extracted from annual birth data by a variety of mother's characteristics in order to examine this trend. Our research extends this analysis by applying multivariate analysis to individual-level data obtained from public-use data sets on all births in the United States from 1985 to 2009. We combine the data into a single, multi-year data file (an .xdf file easily accessed by R) containing over 100-million birth records. To analyze the relationship between parental characteristics and multiple birth pregnancies, we first change the unit of observation from the baby to the pregnancy in order to remove replicated observations of parents of multiples. Then, estimating a logistic regression on all of the remaining observations, we show that the trends in increased multiple births are more strongly associated with the age of father than the age of mother, and that controlling for ages, the relative incidence of multiple births for black mothers has been declining.TRANSCRIPT
![Page 1: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/1.jpg)
Revolution Confidential
T he R is e in Multiple B irths in the U.S .: A n A nalys is of a
Hundred-Million B irth R ec ords with R
S us an I. R anney, P h.D.
J S M 2012
![Page 2: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/2.jpg)
Revolution Confidential T he Tools
Open Source R Flexible, powerful, great graphics Great for prototyping, not very scalable
RevoScaleR (with Revolution R Enterprise) Efficient file format (.xdf) Functions of accessing/importing external data sets
(fixed format & delimited text, SPSS, SAS, ODBC) Very fast, parallelized, distributed analysis functions
(summary stats, crosstabs/cubes, linear models, kmeans clustering, logistic regression, glm)
2
![Page 3: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/3.jpg)
Revolution Confidential T he Data
Public-use data sets containing information on all births in the United States for each year from 1985 to 2009 are available to download: http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm
“These natality files are gigantic; they’re approximately 3.1 GB uncompressed. That’s a little larger than R can easily process” – Joseph Adler, R in a Nutshell
3
![Page 4: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/4.jpg)
Revolution Confidential T he U.S . B irth Data (c ontinued)
Data for each year are contained in a compressed, fixed-format, text files
Typically 3 to 4 million records per file Variables and structure of the files sometimes change
from year to year, with birth certificates revised in1989 and 2003. Warnings:
NOTE: THE RECORD LAYOUT OF THIS FILE HAS CHANGED SUBSTANTIALLY. USERS SHOULD READ THIS DOCUMENT CAREFULLY. Reporting can differ state-to-state for any given year
4
![Page 5: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/5.jpg)
Revolution Confidential T he P roc es s
Basic “life cycle of analysis” Import and combine data 25 years Over 100 million obs.
Check and clean data Basic variable summaries Big data logistic/glm regressions
All on my laptop Option to distribute computations
to cluster
5
![Page 6: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/6.jpg)
Revolution Confidential T he Ques tion
6
CDC Report in Jan. 2012
![Page 7: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/7.jpg)
Revolution Confidential T he Ques tion
What accounts for the increase in multiple births in the United States? Can we separate out effects of mother’s and father’s ages, race/ethnicity? Examine time trends by sub-group, assumed to be associated with fertility treatment CDC finding: The older age of women at childbirth accounts for only about 1/3 of the rise in twinning over 30 years (but this mixes in the increased rate of “twinning” for older women)
7
![Page 8: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/8.jpg)
Revolution Confidential
Importing the U.S . B irth Data for Us e in R
8
![Page 9: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/9.jpg)
Revolution Confidential E xample of Differenc es for Different Years
To create common variables across years, use common names and new factor levels for ‘colInfo’ in rxImport function. For example:
For 1985: SEX = list(type="factor", start=35, width=1, levels=c("1", "2"), newLevels = c("Male", "Female"), description = "Sex of Infant“)
For 2003: SEX = list(type="factor", start=436, width=1, levels=c("M", "F"), newLevels = c("Male", "Female"), description = "Sex of Infant”)
9
![Page 10: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/10.jpg)
Revolution Confidential C reating Trans formed Variables on Import
Use standard R syntax to create transformed variables.
For example, create a factor for Mom’s Age using the imported MAGER integer variable:
MomAgeR7 = cut(MAGER, breaks =c(0, 19, 24, 29, 34, 39, 98, 99), labels = c("Under 20", "20-24", "25-29", "30-34", "35-39", "Over 39", "Missing"))
Create binary variable for “IsMultiple” IsMultiple = DPLURAL_REC != 'Single'
10
![Page 11: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/11.jpg)
Revolution Confidential S teps for C omplete Import Lists for column information and transforms are
created for 3 base years: 1985, 1989, 2003 when there were very large changes in the structure of the input files Changes to these lists are made where
appropriate for in-between years A test script is run, importing only 1000
observations per year for a subset of years Full script is run, importing each year, sorting
according to key demographic characteristics, and appending to a master .xdf file
11
![Page 12: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/12.jpg)
Revolution Confidential
E xamining and C leaning the B ig Data F ile
12
![Page 13: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/13.jpg)
Revolution Confidential E xamining B as ic Information Basic file information >rxGetInfo(birthAll) File name: C:\Revolution\Data\CDC\BirthUS85to09S.xdf Number of observations: 100672041
Number of variables: 50 Number of blocks: 215
Use rxSummary to compute summary statistics for continuous data and category counts for each of the factors (about 4 minutes on my laptop)
rxSummary(~., data=birthAll, blocksPerRead = 10)
13
![Page 14: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/14.jpg)
Revolution Confidential E xample of S ummary S tatis tic s
DadAgeR8 Counts
Under 20 3226602
20-24 15304803
25-29 23805056
30-34 23179418
35-39 13289015
40-44 4984146
Over 44 2140207
Missing 14742794
14
MomAgeR7 Counts
Under 20 11918891
20-24 25975642
25-29 28701398
30-34 22341530
35-39 9788753
Over 39 1945827
Missing 0
![Page 15: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/15.jpg)
Revolution Confidential His tograms by Year
Easily check for basic errors in data import (e.g. wrong position in file) by creating histograms by year – very fast (just seconds on my laptop) Example: Distribution of mother’s age by
year. Use F() to have the integer year treated as a factor.
rxHistogram(~MAGER| F(DOB_YY), data=birthAll, blocksPerRead = 10, layout=c(5,5))
15
![Page 16: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/16.jpg)
Revolution Confidential A ge of Mother Over T ime
16
![Page 17: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/17.jpg)
Revolution Confidential Drill Down and E xtrac t S ubs amples
Take a quick look at “older” fathers: rxSummary(~F(UFAGECOMB),
data=birthAll,
blocksPerRead = 10)
What’s going on with 89-year old Dads? Extract a data frame:
dad89 <- rxDataStep(
inData = birthAll,
rowSelection = UFAGECOMB == 89,
varsToKeep = c("DOB_YY", "MAGER",
"MAR", "STATENAT", "FRACEREC"),
blocksPerRead = 10)
17
Dad’s Age Counts 80 141 81 108 82 81 83 74 84 56 85 43 86 43 87 26 88 27 89 3327
![Page 18: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/18.jpg)
Revolution Confidential Year and S tate for 89-Year-Old F athers rxCube(~F(DOB_YY):STATENAT, data=dad89, removeZeroCounts=TRUE)
18
F_DOB_YY STATENAT Counts 1990 California 1 1999 California 1 2000 California 1 1996 Hawaii 1 1997 Louisiana 1 1986 New Jersey 1 1995 New Jersey 1 1996 Ohio 1 1989 Texas 3316 1990 Texas 1 2001 Texas 1 1985 Washington 1
![Page 19: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/19.jpg)
Revolution Confidential
Demographic s of Multiple B irths 1985-2009
19
![Page 20: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/20.jpg)
Revolution Confidential Unit of Obs ervation: Delivery
Appropriate unit of observation is the delivery (resulting in 1 or more live births) rather than the individual birth. Use 1/Plurality as probability weight Alternative: Look at nearby records to
compute the “Reported Delivery Birth Order” (RDPO), then select on only the 1st born in the delivery for the analysis
20
![Page 21: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/21.jpg)
Revolution Confidential
21
![Page 22: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/22.jpg)
Revolution Confidential
22
![Page 23: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/23.jpg)
Revolution Confidential
23
![Page 24: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/24.jpg)
Revolution Confidential
Multiple B irths L ogis tic R egres s ion & P redic tions
24
![Page 25: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/25.jpg)
Revolution Confidential L ogis tic R egres s ion Logistic Regression Results for: IsMultiple ~ DadAgeR8 + MomAgeR7 +
FRaceEthn + MRaceEthn +
DadAgeR8:FRaceEthn:MNTHS_SINCE_JAN85 + MomAgeR7:MRaceEthn:MNTHS_SINCE_JAN85
File name: C:\Revolution\Data\CDC\BirthUS85to09R.xdf
Probability weights: PluralWeight
Dependent variable(s): IsMultiple
Total independent variables: 118 (Including number dropped: 12)
Number of valid observations: 100672041
Number of missing observations: 0 -2*LogLikelihood: 14555447.8514
(Residual deviance on 100671935 degrees of freedom)
About 6 minutes on my laptop
25
![Page 26: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/26.jpg)
Revolution Confidential
C ounts of Deliveries by all Demographic C ombinations by Year rxCube(~DadAgeR8:MomAgeR7:FRaceEthn:MRaceEthn:F(DOB_YY),
data = birthAllR, pweights = "PluralWeight",
blocksPerRead = 10)
Under 10 seconds to compute on my laptop Resulting data frame has 50,400 rows
representing all the demographic combinations for each of the 25 years 44,661 have counts <= 1000 Provides input data for predictions and
weights for aggregating predictions
26
![Page 27: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/27.jpg)
Revolution Confidential P redic t and A ggregate by Year predOut <- rxPredict(logitObj, data = catAllDF) Create predictions for each detailed
demographic group Aggregate for each year using population
percentages for each detailed group for each year
Compare with actuals Perform “What if?” scenarios Scenario 1: Only change in demographic; no
“fertility-treatment” time trand
27
![Page 28: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/28.jpg)
Revolution Confidential
28
![Page 29: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/29.jpg)
Revolution Confidential
Drill Down: L ook at P redictions for S pec ific G roups
Select out predicted values from prediction data frame for detailed groups: Age of mother and father Race/ethnicity of mother and father
29
![Page 30: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/30.jpg)
Revolution Confidential
30
![Page 31: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/31.jpg)
Revolution Confidential
31
![Page 32: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/32.jpg)
Revolution Confidential
Revolution Confidential
C onc lus ions from L ogis tic R egres s ion
Dads matter: Use of fertility treatment associated with older dads as well as older Mom’s
Hispanics show relatively small increase in multiples for both younger and older couples
Asian pattern similar to whites, but lower for all age groups
Black show similar increase in multiples for both younger and older couples Other Factors Related to Multiples? (Diet, high BMI,
other genetic factors)
32
![Page 33: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/33.jpg)
Revolution Confidential
Revolution Confidential
B ut How Many B abies ?
33
![Page 34: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/34.jpg)
Revolution Confidential
Revolution Confidential
G L M Tweedie Model: How Many B abies ?
When power parameter is between 1 and 2, Tweedie is a compound Poisson distribution Appropriate for data with positive data that
also includes a ‘clump’ of exact zeros Dependent variable: Number of additional
babies (Plurality – 1) Same independent variables
34
![Page 35: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/35.jpg)
Revolution Confidential
Revolution Confidential
35
![Page 36: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/36.jpg)
Revolution Confidential
Revolution Confidential
F urther R es earc h
Data management Further cleaning of data Import more variables Import more years (additional use of weights)
Multiple Births Analysis More variables (e.g., proxies for fertility treatment
trends) Investigation of sub-groups (e.g., young blacks) Improved computation of number of births per
delivery Other analyses with birth data
36
![Page 37: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/37.jpg)
Revolution Confidential
Revolution Confidential
S ummary of A pproac h “Unpredictability” of multiple births requires
large data set to have the power to capture effects
Significant challenges in importing and cleaning the data – using R and .xdf files makes it possible
Even with a huge data, “cells” of tables looking at multiple factors can be small
Using combined.xdf file, we can use individual-level analysis to examine conditional effects of a variety of factors
37
![Page 38: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/38.jpg)
Revolution Confidential
Revolution Confidential
R eferenc es Martin JA, Hamilton BE, Osterman MJK. Three
decades of twin births in the United States, 1980-2009. NCHS data brief, no. 80. Hyattsville, MD: National Center for Health Statistics. 2012.
Blondel B, Kaminiski, M. Trends in the Occurrence, Determinants, and Consequences of Multiple Births. Seminars in Perinatology. 26(4):239-49, 2002.
Vahratian, Anjel. Utilization of Fertility-Related Services in the United States. Fertil Steril. 2008 October; 90(4):1317-1319.
38
![Page 39: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R](https://reader030.vdocuments.us/reader030/viewer/2022020122/55630601d8b42a6f598b5476/html5/thumbnails/39.jpg)
Revolution Confidential
Revolution Confidential
T hank you! R-Core Team R Package Developers R Community Revolution R Enterprise Customers and Beta
Testers Colleagues at Revolution Analytics Contact: [email protected] 39