documenting research project process for reproducibility
DESCRIPTION
Documenting Research Project Process for Reproducibility. Larry Hoyle Institute for Policy & Social Research University of Kansas. The challenges. Large (or complex) multi-disciplinary projects Multiple sites, data streams, standards, and practices Complex data preparation procedures - PowerPoint PPT PresentationTRANSCRIPT
Dagstuhl Presentationn 2012 - Larry Hoyle 1
Documenting Research Project Process for Reproducibility
Larry HoyleInstitute for Policy & Social Research
University of Kansas
10/22/2012
Dagstuhl Presentationn 2012 - Larry Hoyle 2
The challenges
• Large (or complex) multi-disciplinary projects– Multiple sites, data streams, standards, and
practices– Complex data preparation procedures
• Point and click software used• Documenting as overhead
10/22/2012
Dagstuhl Presentationn 2012 - Larry Hoyle 3
Example Project
• Farmer's land use decisions related to climate change (e.g. biofuel related crops)
• One component of larger NSF grant • Multiple teams, multiple universities – The two main sites are 135 km apart
• Multi-disciplinary– Economists, geographers, agronomists, biologists, engineers,
climate scientists, anthropologist, sociologist, political scientists, urban planner, GIS experts, photographer
10/22/2012
Dagstuhl Presentationn 2012 - Larry Hoyle 4
Example Project Data– Develop substantial geodatabase (ARC SDE)
• ground cover, soils, crop statistics, facility locations (e.g. purchaser, processing plant). Weather, climate, watershed and aquifer models,
• Sub-(farmer’s) field geographic level – Climate models at different scales– Focus groups and multi wave survey (geocoded)– Interviews coded in NVIVO (geocoded)– Photographs– Large proprietary dataset with time-limited use
10/22/2012
Challenge - put it all together and document how it was done and how everything relates.
Other example: Iassist posting
Dagstuhl Presentationn 2012 - Larry Hoyle 5
Spatial Aspects
• Reconciling different spatial schemes at multiple scales across time– Raster images, – model grids at different scales, – weather point sources, other point locations (e.g. biorefineries), – political entity polygons (state, county), – farm field and sub-field polygons, – Attribute data at all these levels, imputed and aggregated data
• Harmonizing data from different geographic schemes• Producing new spatial objects
– E.G. corners as separate from circle with center-pivot irrigation
10/22/2012
Dagstuhl Presentationn 2012 - Larry Hoyle 6
New Polygons
10/22/2012
Polygons to be extracted from remote sensing imagery
Subfield areas sometimes growdifferent crops(corners are 21% of the square)
Dagstuhl Presentationn 2012 - Larry Hoyle 7
Need to Capture Process Example 1
• Project member with expertise volunteered to process data to produce a spatial dataset (soils data).
• Users of the dataset discover anomalies• Expert no longer available, can’t remember
quite what he did and has no documentation (used point and click tools)
• Ouch
10/22/2012
Dagstuhl Presentationn 2012 - Larry Hoyle 8
Process Example 2
• Qualitative analysis– Transcription– Multiple coders, common coding scheme– Coding scheme evolves (capture this?)– Training– Paired coders code each interview– Testing of coder reliability
• Integrate this after the fact with geodatabase
10/22/2012
Dagstuhl Presentationn 2012 - Larry Hoyle 9
Point and Click
• Some tools are only point and click and don’t create a log.– E.g. Some procedures in ArcGIS
• How do you document process– Screen capture pasted into Word?– Action recording software– Discoverable? Machine actionable?
10/22/2012
Dagstuhl Presentationn 2012 - Larry Hoyle 10
An ArcGIS process (different project)
10/22/2012
NSFCHEMAnnualDataProcedure.docx
AnnualLinksByTime4.avi
Dagstuhl Presentationn 2012 - Larry Hoyle 11
Need Tools
• There is a need for tools built on top of standards that make it easy to capture and annotate process
10/22/2012
Dagstuhl Presentationn 2012 - Larry Hoyle 12
Need Tools to Capture ProcessOne example – SAS Enterprise Guide
10/22/2012
• Can modify nodes during development. • Can run the process from any point
• But – overall process may involve multiple tools - in this case also R and ArcGIS. In other cases, multiple people in different settings.
Scott Long - The Workflow of Data Analysis Using Statahttp://www.indiana.edu/~jslsoc/web_workflow/wf_home.htm
Datasets – Permanent and temporary
Dagstuhl Presentationn 2012 - Larry Hoyle 13
Capturing Process as it is Being Developed• False starts and blind alleys– Does the whole process matter or only a process that
reproduces the final result? (learn from my mistakes?)– Description of process gets edited as it evolves
• Adding minimal overhead– If the tool requires a lot of attention it won’t get used.
• Combining sub-processes• Filling in pieces of overall planned project• Parallel parts• Time as ordinal or interval (or ratio?)
10/22/2012
Dagstuhl Presentationn 2012 - Larry Hoyle 14
• Annotated screen capture – works on top of any software– Text (or audio/video?) annotation– Dealing with IP in captured images– Flow diagram with popups?– Editable– Time stamped
Tools – The Fantasy
10/22/2012
Sub process edited separately
Planned overall process
Persistent identifiers allow (re-)linking
Dagstuhl Presentationn 2012 - Larry Hoyle 15
Final thoughts
• Metadata for the audience– Documentation for reproducibility– Documentation in cases of disputed results
• Sometimes the researcher is the audience– One researcher commented that having documentation
at this level would be very helpful in writing methods sections of papers.
– Teaching tool - critique students process– Assists refining methods– Also useful in future similar projects
10/22/2012