replicating results- procedures and pitfalls june 1, 2005

Replicating Results- Procedures and Pitfalls

June 1, 2005

The JMCB Data Storage and Evaluation Project

• Project summary– Part 1- July 1982 JMCB started requesting

programs/data from authors– Part 2- attempt replication of published results based

on submissions

• Review of results from Part 2 in

Replication in Empirical Economics: The Journal of Money, Credit and Banking Project; The American Economic Review, Sept 1986, by Dewald, Thursby, Anderson

The JMCB Data Storage and Evaluation Project/ Dewald et al

• The paper focuses on Part 2 - How people responded to the request- Quality of the data that was submitted- The actual success (or lack thereof) of

replication efforts

The JMCB Data Storage and Evaluation Project/ Dewald et al

• Three groups:– Group 1: Papers submitted and published

prior to 1982. These authors did not know upon submission that they would be subsequently asked for programs/data.

– Group 2: Authors whose papers were accepted for publication beginning July, 1982

– Group 3: Authors whose papers were under review beginning July, 1982

Group 1 Group 2 Group 3

Requests 62 27 65

Responses:

Total

Percent

Mean response time in days

42

68

217

%

26

96

125

%

49

75

130

%

Datasets Submitted 22

35%

21

78%

47

72%

Datasets Not Submitted: 40 6 18

Confidential Data 2 1 0

Lost or Destroyed Data 14 2 1

Data Available, but not Sent 4 2 1

Nonrespondents 20 1 16

Summary of Responses/Datasets Submitted,

Dewald et al, p 591

Summary of Examined Datasets Dewald et al, p 591-592

Group 1 Group 2 Group 3

Total Datasets Submitted 22 20 47

Data Sets Examined 19 14 21

No Problems 1 3 4

Problems by type

Incomplete Submission 6 3 5

Sources Cited Incorrectly 0 4 4

Sources Cited Imprecisely 11 7 10

Data Transformations Described

Incompletely

3 4 1

Data Element Not Clearly Defined 2 3 2

Other 0 3 1

Total 22 24 23

“Our findings suggest that inadvertent errors in published empirical articles are a

commonplace rather than a rare occurrence.” – Dewald et al, page 587-588

“We found that the very process of authors compiling their programs and data for

submission reveals to them ambiguities, errors, and oversights which otherwise

would be undetected.” – Dewald et al, page 589

Raw data to finished product

Raw data

Analysis data

Runs/results

Finished product

Raw Data -> Analysis Data

• Always have two distinct data files- the raw data and analysis data

• A program should completely re-create analysis data from raw data

• NO interactive changes!! Final changes must go in a program!!

Raw Data -> Analysis Data

• Document all of the following:– Outliers?– Errors?– Missing data?– Changes to the data?

• Remember to check-– Consistency across variables– Duplicates– Individual records, not just summary stats– “Smell tests”

Analysis Data -> Results

• All results should be produced by a program

• Program should use analysis data (not raw)

• Have a “translation” of raw variable names -> analysis variable names -> publication variable names

Analysis Data -> Results

• Document-– How were variances estimated? Why?– What algorithms were used and why? Were

results robust? – What starting values were used? Was

convergence sensitive?– Did you perform diagnostics? Include in

programs/documentation.

Thinking ahead

• Delete or archive old files as you go

• Use a meaningful directory structure (/raw, /data, /programs, /logfiles, /graphs etc.)

• Use relative pathnames

• Use meaningful variable names

• Use a script to sequentially run programs

Example script to sequentially run programs

1. #! /bin/csh 2. #File location: /u/machine/username/project/scripts/myproj.csh 3. #Author: your name 4. #Date: 9/21/04

5. #This script runs a do-file in Stata which produces and saves a dta-file 6. #in the data directory. Stat-transfer converts the .dta file to .sas7bdat 7. #and saves the file in the data folder. The program analyze.sas is run

on 8. #the new sas data-file.

9. cd /u/machine/username/project/ 10. stata -b do programs/cleandata.do 11. st data/H00x_B.dta data/$file.sas7bdat 12. sas programs/analyze.sas

Log files

• Your log file should tell a story to the reader.

• As you print results to the log file, include words explaining the results

• Don’t output everything to the log-file- use quietly and noisily in a meaningful way.

• Include not only what your code is doing, but your reasoning and thought process

Project Clean-up

• Create a zip file that contains everything necessary for complete replication

• Delete/archive unused or old files

• Include any referenced files in zip

• When you have a final zip archive containing everything-– Open it in it’s own directory and run the script– Check that all the results match

When there are data restrictions…

• Consider releasing:– the subset of the raw data used– your analysis data as opposed to raw data– (at a minimum) notes on process from raw to analysis

data PLUS everything pertaining to the data analysis

• Consider “internal” and “external” version of your log-file:– Do this via a variable at the top of your log-files:

local internal = 1

…

list if `internal’ == 1

Ethical Issues

• All authors are responsible for proper clean-up of the project

• Extremely important whether or not you plan on releasing data and programs

• Motivation– self-interest– honest research– the scientific method – allowing others to be critical of your methods/results– furthering your field

Ethical Issues – for discussion

• What if third-party redistribution of data is not allowed?

• Solutions for releasing data while protecting your time investment in data collection

• Is it unfair to ask people to release data after a huge time investment in the collection?

replicating results- procedures and pitfalls june 1, 2005

Documents