Lars Vilhuber
2019-10-01
Cornell University
(alt)
Consider the AEA's suggested README and the Social Science Data Editors' guidance for verification:
Old method: send the journal a ZIP file
Source: Your laptop
Destination: random file on a journal website
Questions/ What-ifs:
Old method: send the journal a ZIP file
Source: Your laptop
Destination: random file on a journal website
Questions/ What-ifs:
These are provenance questions.
Old method: send the journal a ZIP file
Source: Your laptop
Destination: random file on a journal website
Questions/ What-ifs:
These are FAIR questions
The Census Bureau put out a blog post with data.
the replication project page: https://larsvilhuber.github.io/jobcreationblog/README.html
Original
Replicated
“Data” in this project:
Where available:
The role of journals is to provide a permanent record of scientific knowledge.
Goal 2: Be able to curate the data and code necessary for reproducible analysis
the project page:
the code behind it: https://github.com/larsvilhuber/jobcreationblog
git
Distributed version control system, created by Linus Torvalds in 2005
Github.com
At a high level, GitHub is a website and cloud-based service that helps developers store and manage their code, as well as track and control changes to their code. [1]
Also a collaboration tool when multiple people (developers, researchers) collaborate in a structured fashion on text/code/programs/etc.
Both GitLab (and GitLab.com) and GitHub (and GitHub.com) are products providing Git repository hosting service. [1]
Also:
at least for academics.
Many training opportunities and tutorials out there
While these sites make it really easy to publish your code/website/etc.
They are NOT archives.
Github pages, much as private websites, can be unpublished at any time:
In fact, the entire code repository can be deleted at any time:
Goal 3: Robustness and automation - getting close to push-button reproducibility
(Advanced features of Git(hub,lab) allows us to implement and test that)
Goal 4: Correctly document reproducible research
Goal 3: Robustness and automation - getting close to push-button reproducibility
(Advanced features of Git(hub,lab) allows us to implement and test that)
Goal 4: Correctly document reproducible research
Rather than squint on code on the screen, let's … replicate my replication. Online. Now.
We will
Requirements:
You can delete all online materials at the end of the class.
Rather than squint on code on the screen, let's … replicate my replication. Online. Now.
Import project from Git
Fork an existing Gitlab project:
GITLAB.com/larsvilhuber/jobcreationblog
Goal 2: Be able to curate the data and code necessary for reproducible analysis
Other cloud-based compute environments:
(replace “larsvilhuber” with your own Gitlab name space, or your Github clone URL)
However, they do not solve everything…
The problem is not just in R:
packrat
or checkpoint
functionality####################################
# global libraries used everywhere #
####################################
# Package lock in - optional
MRAN.snapshot <- "2019-01-01"
options(repos = c(CRAN = paste0("https://mran.revolutionanalytics.com/snapshot/",MRAN.snapshot)))
pkgTest <- function(x)
{
if (!require(x,character.only = TRUE))
{
install.packages(x,dep=TRUE)
if(!require(x,character.only = TRUE)) stop("Package not found")
}
return("OK")
}
global.libraries <- c("dplyr","devtools","rprojroot","tictoc")
results <- sapply(as.list(global.libraries), pkgTest)
// Make a path local to the project
// Also see my related config.do at
// https://gist.github.com/larsvilhuber/6bcf4ff820285a1f1b9cfff2c81ca02b
local pwd "/c/path/to/project"
capture mkdir `pwd'/ado
sysdir set PERSONAL `pwd'/ado/personal
sysdir set PLUS `pwd'/ado/plus
sysdir set SITE `pwd'/ado/site
/* Now install them */
/*--- SSC packages ---*/
foreach pkg in outreg esttab someprog {
ssc install `pkg'
}
Goal 3: Robustness and automation - getting close to push-button reproducibility
By solving dependencies explicitly, robustness is improved.
By doing so with a dynamic function, automation is possible.
Goal 4: Correctly document reproducible research
Documenting dependencies is a critical part of reproducible research.
Add text to the document:
# hidden dependency, will install packages that are needed
source("global-config.R",echo=FALSE)
and adjust global-config.R
to also list knitcitations
Enable popups for this site:
Because we used Git**b, we have the changes already under control:
Since we used Gitlab, you can compare the changes: https://gitlab.com/larsvilhuber/jobcreationblog/compare/v1.0…master?view=parallel
This is one way to recover changes, and then verbosely describe them to the editor/thesis advisor/etc.!
Data, source document, dependencies
Today:
Tomorrow: