documenting data transformations
TRANSCRIPT
![Page 1: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/1.jpg)
“Provenance and Social Science Data”15 March 2017
Documenting DataTransformations
George Alter, University of Michigan
![Page 2: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/2.jpg)
• Data are useless without Metadata – “data about data”
• Metadata should:– Include all information about data creation– Describe transformations to variables– Be easy to create
• Our goal: Automated capture of metadata
Why Metadata?
![Page 3: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/3.jpg)
A few words about ICPSR
• World’s largest archive of social science data
• Consortium established 1962
• 760+ member institutions around the world
• Founding member and home office for the DDI Alliance
![Page 4: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/4.jpg)
Powered by DDI Metadata
ICPSR is building search tools based upon Data Documentation Initiative (DDI) XML
Codebooks (pdf and online) are rendered from the DDI.
![Page 5: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/5.jpg)
Searchable database of 4.5M variables
Click here for online codebook
![Page 6: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/6.jpg)
Online codebook shows variable in context of dataset
Link to online crosstab tool
What question was asked?
How was the question coded?Link to online
graph tool
![Page 7: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/7.jpg)
Searchable database of 4.5M variables
Click here for variable comparison
![Page 8: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/8.jpg)
Variable comparisondisplay
Click here for online codebook
![Page 9: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/9.jpg)
Search for datasets with 3 desired variables
Check boxes for variable comparison
![Page 10: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/10.jpg)
Crosswalk for American National Election Study (ANES) and General Social Survey (GSS)
Columns link to 70 datasets
134 tags in 8 lists
Variable comparison display
Variables linked to online codebooks
![Page 11: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/11.jpg)
Metadata for the American National Election Study
What question was asked?
Who answered this question?
How was the question coded?
Who answered this question?
![Page 12: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/12.jpg)
Metadata for the American National Election Study
Who answered this question?
Who answered this question?
How do we know who answered the question?
It’s in the pdf.
![Page 13: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/13.jpg)
When data arrive at the archive…
• No question text• No interview flow (question order, skip
pattern)• No variable provenance• Data transformations are not documented.
![Page 14: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/14.jpg)
How is research data created?
• Most surveys are conducted with computer assisted interview software (CAI)– CATI – Computer-assisted Telephone Interview– CAPI – Computer-assisted Personal Interview– CAWI – Computer Aided Web Interview
• There is no paper questionnaire• The CAI program is the questionnaire– i.e. the program is the metadata
![Page 15: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/15.jpg)
Originaldata
DDI XML
Original metadata
CAI
CAI to
DDI
Convert to DDI:
CollecticaMQDSothers
Computer Assisted
Interviewing
We already have tools to convert CAI to machine-
readable metadata.
![Page 16: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/16.jpg)
SPSSSA
SStat
aR
Command scripts:
Originaldata
DDI XML
Original metadata
Reviseddata
SPSSSASStata
R
CAI
CAI to
DDI
Statistical Packages
Convert to DDI:
CollecticaMQDSothers
Computer Assisted
Interviewing
What happens when a project modifies the data.
The modified data no longer
match the metadata.
![Page 17: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/17.jpg)
SPSSSA
SStat
aR
Command scripts:
Originaldata
DDI XML
Original metadata
Reviseddata
SPSSSASStata
R
SPSSSASStata
R
CAI
CAI to
DDI
Statistical Packages
Convert to DDI:
CollecticaMQDSothers
Computer Assisted
Interviewing
Stat Packag
e to DDI
DDI XML
Extracted metadata
Extract metadata
from SPSS/SAS/
Stata/RData file
Metadata are re-created after the
data are transformed.
Transformations are
documented by hand
![Page 18: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/18.jpg)
Statistics packages have limited metadata
• Variable names• Variable labels• Value labels• No provenance
![Page 19: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/19.jpg)
SDTL
XML Update
r
DDI XML
SPSSSA
SStat
aR
Script Parser
Command scripts:
Originaldata
Revised metadata
DDI XML
Original metadata
Reviseddata
SPSSSASStata
R
CAI
CAI to
DDI
Statistical Packages
StandardData
Transformation Language
Convert to DDI:
CollecticaMQDSothers
Computer Assisted
Interviewing
Automating the capture of
transformation metadata.
Missing links that we will build.
![Page 20: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/20.jpg)
What statistics packages should be covered?
ICPSR Downloads by Format
All downloadsStudies with all
formatsDelimited text 43% 29%SPSS 22% 24%SAS 10% 12%Stata 19% 23%R 5% 12%Excel 0% 1%Other 0% 0%
100% 100%Number 378,007 154,663
![Page 21: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/21.jpg)
Input Data Output DataSPSSMISSING VALUES X(-1).IF (X > 3) Y=9.IF (X < 3) Z=8.
X234-1
Statareplace X=. if X==-1generate Y=9 if X>3generate Z=8 if X<3
X234-1
SASif X=-1 then X=.;if X>3 then Y=9;if X<3 then Z=8;
X234-1
Why do we need an SDTL?
![Page 22: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/22.jpg)
Input Data Output DataSPSSMISSING VALUES X(-1).IF (X > 3) Y=9.IF (X < 3) Z=8.
X X Y Z2 2 83 34 4 9-1 -1
Statareplace X=. if X==-1generate Y=9 if X>3generate Z=8 if X<3
X X Y Z2 2 83 34 4 9-1 9
SASif X=-1 then X=.;if X>3 then Y=9;if X<3 then Z=8;
X X Y Z2 2 . 83 3 . .4 4 9 .-1 . . 8
Why do we need an SDTL?
![Page 23: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/23.jpg)
What happens when a missing value is in a logical comparison?• SPSS– Logical expressions including a missing value are
considered “Missing.” Usually, “Missing” is equivalent to “False.”
• Stata– Missing values are treated as numbers equal to
infinity. So, any number is less than a missing value.• SAS– Missing values are treated as numbers equal to minus
infinity. So, any number is greater than a missing value.
![Page 24: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/24.jpg)
Input Data Output DataSPSSMISSING VALUES X(-1).IF (X > 3) Y=9.IF (X < 3) Z=8.
X X Y Z2 2 83 34 4 9-1 NULL
Statareplace X=. if X==-1generate Y=9 if X>3generate Z=8 if X<3
X X Y Z2 2 83 34 4 9-1 ∞ 9
SASif X=-1 then X=.;if X>3 then Y=9;if X<3 then Z=8;
X X Y Z2 2 . 83 3 . .4 4 9 .-1 -∞ . 8
Missing Values in Comparisons
![Page 25: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/25.jpg)
Benefits of automated metadata capture
• Metadata will be better– All the information in the CAI can be included.– Variable transformations can be described
• Automation will lower costs– Metadata will not be discarded and re-created
• All metadata will be standardized and machine readable– Codebooks with rich information can be rendered at
will• If we make it easy and beneficial, researchers
will use it.
![Page 26: Documenting Data Transformations](https://reader035.vdocuments.us/reader035/viewer/2022070517/58d046661a28ab8e5b8b6471/html5/thumbnails/26.jpg)
Continuous Capture of Metadata for Statistical Data
(NSF ACI-1640575)Project Partners•Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan•Colectica•Metadata Technology North America•Norwegian Centre for Research Data•General Social Survey, NORC, University of Chicago•American National Election Study, University of Michigan