replicating results- procedures and pitfalls june 1, 2005
TRANSCRIPT
Replicating Results- Procedures and Pitfalls
June 1, 2005
The JMCB Data Storage and Evaluation Project
• Project summary– Part 1- July 1982 JMCB started requesting
programs/data from authors– Part 2- attempt replication of published results based
on submissions
• Review of results from Part 2 in
Replication in Empirical Economics: The Journal of Money, Credit and Banking Project; The American Economic Review, Sept 1986, by Dewald, Thursby, Anderson
The JMCB Data Storage and Evaluation Project/ Dewald et al
• The paper focuses on Part 2 - How people responded to the request- Quality of the data that was submitted- The actual success (or lack thereof) of
replication efforts
The JMCB Data Storage and Evaluation Project/ Dewald et al
• Three groups:– Group 1: Papers submitted and published
prior to 1982. These authors did not know upon submission that they would be subsequently asked for programs/data.
– Group 2: Authors whose papers were accepted for publication beginning July, 1982
– Group 3: Authors whose papers were under review beginning July, 1982
Group 1 Group 2 Group 3
Requests 62 27 65
Responses:
Total
Percent
Mean response time in days
42
68
217
%
26
96
125
%
49
75
130
%
Datasets Submitted 22
35%
21
78%
47
72%
Datasets Not Submitted: 40 6 18
Confidential Data 2 1 0
Lost or Destroyed Data 14 2 1
Data Available, but not Sent 4 2 1
Nonrespondents 20 1 16
Summary of Responses/Datasets Submitted,
Dewald et al, p 591
Summary of Examined Datasets Dewald et al, p 591-592
Group 1 Group 2 Group 3
Total Datasets Submitted 22 20 47
Data Sets Examined 19 14 21
No Problems 1 3 4
Problems by type
Incomplete Submission 6 3 5
Sources Cited Incorrectly 0 4 4
Sources Cited Imprecisely 11 7 10
Data Transformations Described
Incompletely
3 4 1
Data Element Not Clearly Defined 2 3 2
Other 0 3 1
Total 22 24 23
“Our findings suggest that inadvertent errors in published empirical articles are a
commonplace rather than a rare occurrence.” – Dewald et al, page 587-588
“We found that the very process of authors compiling their programs and data for
submission reveals to them ambiguities, errors, and oversights which otherwise
would be undetected.” – Dewald et al, page 589
Raw data to finished product
Raw data
Analysis data
Runs/results
Finished product
Raw Data -> Analysis Data
• Always have two distinct data files- the raw data and analysis data
• A program should completely re-create analysis data from raw data
• NO interactive changes!! Final changes must go in a program!!
Raw Data -> Analysis Data
• Document all of the following:– Outliers?– Errors?– Missing data?– Changes to the data?
• Remember to check-– Consistency across variables– Duplicates– Individual records, not just summary stats– “Smell tests”
Analysis Data -> Results
• All results should be produced by a program
• Program should use analysis data (not raw)
• Have a “translation” of raw variable names -> analysis variable names -> publication variable names
Analysis Data -> Results
• Document-– How were variances estimated? Why?– What algorithms were used and why? Were
results robust? – What starting values were used? Was
convergence sensitive?– Did you perform diagnostics? Include in
programs/documentation.
Thinking ahead
• Delete or archive old files as you go
• Use a meaningful directory structure (/raw, /data, /programs, /logfiles, /graphs etc.)
• Use relative pathnames
• Use meaningful variable names
• Use a script to sequentially run programs
Example script to sequentially run programs
1. #! /bin/csh 2. #File location: /u/machine/username/project/scripts/myproj.csh 3. #Author: your name 4. #Date: 9/21/04
5. #This script runs a do-file in Stata which produces and saves a dta-file 6. #in the data directory. Stat-transfer converts the .dta file to .sas7bdat 7. #and saves the file in the data folder. The program analyze.sas is run
on 8. #the new sas data-file.
9. cd /u/machine/username/project/ 10. stata -b do programs/cleandata.do 11. st data/H00x_B.dta data/$file.sas7bdat 12. sas programs/analyze.sas
Log files
• Your log file should tell a story to the reader.
• As you print results to the log file, include words explaining the results
• Don’t output everything to the log-file- use quietly and noisily in a meaningful way.
• Include not only what your code is doing, but your reasoning and thought process
Project Clean-up
• Create a zip file that contains everything necessary for complete replication
• Delete/archive unused or old files
• Include any referenced files in zip
• When you have a final zip archive containing everything-– Open it in it’s own directory and run the script– Check that all the results match
When there are data restrictions…
• Consider releasing:– the subset of the raw data used– your analysis data as opposed to raw data– (at a minimum) notes on process from raw to analysis
data PLUS everything pertaining to the data analysis
• Consider “internal” and “external” version of your log-file:– Do this via a variable at the top of your log-files:
local internal = 1
…
list if `internal’ == 1
Ethical Issues
• All authors are responsible for proper clean-up of the project
• Extremely important whether or not you plan on releasing data and programs
• Motivation– self-interest– honest research– the scientific method – allowing others to be critical of your methods/results– furthering your field
Ethical Issues – for discussion
• What if third-party redistribution of data is not allowed?
• Solutions for releasing data while protecting your time investment in data collection
• Is it unfair to ask people to release data after a huge time investment in the collection?