Time | January 19, 2025 |
---|---|
8:00 | Breakfast |
9:00 | Introduction |
10:00 | Reproducible practices, A template README |
10:50 | Coffee break |
11:00 | Data provenance, data citations |
12:00 | Lunch Break |
2025-01-19
Time | January 19, 2025 |
---|---|
8:00 | Breakfast |
9:00 | Introduction |
10:00 | Reproducible practices, A template README |
10:50 | Coffee break |
11:00 | Data provenance, data citations |
12:00 | Lunch Break |
Part 1:
While we are assessing reproducibility of others, our work must be reproducible as well.
git
. We will show you how.REPLICATION.md
)Here’s a few generic guidelines for researchers. You will be on the lookout for these things!
Structure your project
Version your project (git
)!
Track metadata
/inputs /outputs /code /paper
/datos/ /brutos /limpiados /finales /codigo /articulo
It doesn’t really matter, as long as it is logical. We will get to how this translates to confidential or big data in a moment!
It might be “Future You!”
The replicator is the first (?) reader of the instructions who will need to reproduce the analysis.
Use programming-language specific code as much as possible
Avoid
system("unzip C:\data\myfile.zip")
or
shell unzip "C:\data\myfile.zip"
Most languages have appropriate code:
R:
unzip(zipfile, files = NULL, list = FALSE, overwrite = TRUE, junkpaths = FALSE, exdir = ".", unzip = "internal", setTimes = FALSE)
Stata:
unzipfile "zipfile.zip" [, replace]
Use neutral pathnames (mostly forward slashes)
R: Use functions to combine paths (and/or use forward slashes), packages to make code more portable.
basepath <- rprojroot::find_root(rprojroot::has_file("README.md")) data <- read.dta(file.path(basepath,"path","data.dta"))
Stata: always use forward slashes, even on Windows
global data "/my/computer" use "$data/path/data.dta"
This may no longer work:
/data/ /raw /clean /final /code /article
But this might
/project123/ /data/ /raw /clean /final /code /article /confidential (read-only) /taxes (read-only) /wages (read-only)
File structure thus becomes more complex, but fundamentally not so different:
global taxdata "/confidential/taxes" global salarydata "/confidential/wages" global outputdata "/project/data/clean" // this is where you would write the data you create in this project global results "/project/article" // All tables for inclusion in your paper go here global programs "/project/code" // All programs (which you might "include") are to be found here
Or even more robust:
global basedir "/project123" global confbase "/data/provided" global project "$basedir/project" global taxdata "$confbase/taxes" global salarydata "$confbase/wages" global outputdata "$project/data/clean" // this is where you would write the data you create in this project global results "$project/article" // All tables for inclusion in your paper go here global programs "$project/code" // All programs (which you might "include") are to be found here