Replication and Reproducibility in Social Sciences and Statistics: Overview and Practice

Lars Vilhuber
2019-10-01

Cornell University

Overview

  • High-level overview (60:00)
  • A very concrete example (remainder)

Replication and Reproducibility in Social Sciences and Statistics: Context, Concerns, and Concrete Measures

Paris presentation (alt)

DOI

Goals of this tutorial

  • Goal 1: Identify all the elements of a fully reproducible analysis
  • Goal 2: Be able to curate the data and code necessary for reproducible analysis
  • Goal 3: Robustness and automation - getting close to push-button reproducibility
  • Goal 4: Correctly document reproducible research

Requirements

Requirements

  • web browser
  • some R knowledge (not much)

Sub goals

  • show you enough of the toolkit to have you explore more
  • recognize (some) of the limitations
  • NOT make you a master of this today

Let's get started

Details: Goal 1: Elements of a fully reproducible analysis

Consider the AEA's suggested README and the Social Science Data Editors' guidance for verification:

Verification guidance

Details: Goal 1

Elements

  • Data (where possible)
  • Data provenance
  • Instructions
  • Code (always*)
  • Expected results
  • Persistence

Goal 1: Elements: Data (where possible)

  • Old method: send the journal a ZIP file

  • Source: Your laptop

  • Destination: random file on a journal website

Questions/ What-ifs:

  • the data is not on your laptop?
    • too big
    • on server
    • a database
  • the data is not yours to send
    • confidentiality
    • proprietary
    • other licensing issues

Goal 1: Elements: Data (where possible)

  • Old method: send the journal a ZIP file

  • Source: Your laptop

  • Destination: random file on a journal website

Questions/ What-ifs:

  • how did the data get to your laptop?
  • how did the data get generated?

These are provenance questions.

Goal 1: Elements: Data (where possible)

  • Old method: send the journal a ZIP file

  • Source: Your laptop

  • Destination: random file on a journal website

Questions/ What-ifs:

  • is the ZIP file complete?
  • are the ZIP file contents curated (preserved)?
  • can the data be re-used?
  • can the data be properly attributed to the creator?
  • can the data be found independently of the article?

These are FAIR questions

FAIR Data Principles

  • Findable
  • Accessible
  • Interoperable
  • Re-Usable

FAIR Data Principles

The Example

The Census Bureau put out a blog post with data.

  • I attempted to replicate it
  • The replication itself should be replicable
  • Focus here: my replication of the blog post

The Context

the original page:

url

original page

We are going to focus on 1 figure

Original

original

Replicated

replicated

Let's start

scan

Elements of Goal 1

Elements

  • Data (where possible)
  • Data provenance
  • Instructions
  • Code (always*)
  • Expected results
  • Persistence

Here:

  • Data is small (< 75k) - ✔️
  • Data provenance - URL ✔️
  • Instructions - RMarkdown file ✔️
  • Code - Within Rmd ✔️
  • Expected results - Copy of Figure ✔️
  • Persistence - Github ✔️

Hold on

scan

Elements of Goal 1

Elements

  • Data (where possible)
  • Data provenance
  • Instructions
  • Code (always*)
  • Expected results
  • Persistence

Here:

  • Data is small (< 75k) - ✔️
  • Data provenance - URL ??
  • Instructions - RMarkdown file
  • Code - Within Rmd
  • Expected results - Copy of Figure
  • Persistence - Github ??

Data provenance

“Data” in this project:

  • the blog post
  • the underlying data

Where available:

  • URL
  • URL

First problem

Safeguarding scientific output

The role of journals is to provide a permanent record of scientific knowledge.

  • how reliable is that record?
  • where are journals stored?
  • what if the information is not in a journal?

old library

Safeguarding scientific output

  • journals disappear, as do websites
  • paper journals are stored in libraries
  • e-journals in a system called LOCKSS = Lots of Copies Keep Stuff Safe
  • data should be stored in repositories

stacks

These are still fallible

scan

What is NOT safeguarded

  • random URLs
  • Github repositories

Github not found

Solving the first snag

Solving the first snag

Exercise

Do it now

scan

Result

Goal 2: Be able to curate the data and code necessary for reproducible analysis

scan

Let's start... again

The replication of the original

What is Github?

git

Distributed version control system, created by Linus Torvalds in 2005 branching system

Github.com

At a high level, GitHub is a website and cloud-based service that helps developers store and manage their code, as well as track and control changes to their code. [1]

Also a collaboration tool when multiple people (developers, researchers) collaborate in a structured fashion on text/code/programs/etc.

Gitlab? Github? Git? What's up with that?

Gitlab logo Gitlab logo Gitlab logo

Both GitLab (and GitLab.com) and GitHub (and GitHub.com) are products providing Git repository hosting service. [1]

Also:

  • Bitbucket.com (despite no Git in the name)
  • All of these have free plans for
    • private (non-public) repositories
    • public repositories

at least for academics.

Training for Git

Many training opportunities and tutorials out there

What Github, Gitlab, etc. are NOT

While these sites make it really easy to publish your code/website/etc.

They are NOT archives. Github not found

Github etc. are transitory

Github pages, much as private websites, can be unpublished at any time:

Github unpublish

In fact, the entire code repository can be deleted at any time:

Github delete

Git(hub,lab) as a tool for reproducible research

Goal 3: Robustness and automation - getting close to push-button reproducibility

(Advanced features of Git(hub,lab) allows us to implement and test that)

Goal 4: Correctly document reproducible research

  • Git(hub,lab) allow us to freeze consistent versions of input data, code, and output as “releases”, “versions”, etc., thus making it easy to document your code

Git(hub,lab) as a tool for reproducible research

Goal 3: Robustness and automation - getting close to push-button reproducibility

(Advanced features of Git(hub,lab) allows us to implement and test that)

Goal 4: Correctly document reproducible research

  • (also respond to thesis advisor, referree, editor, curious journalist asking the question “what has changed”)

changes

Getting our hands dirty

Rather than squint on code on the screen, let's … replicate my replication. Online. Now.

We will

  • Make a copy of the code repository
  • Verify that the original code runs, and make any changes necessary
  • Make a (permanent) copy of the code
  • Address any issues on the way

Requirements:

  • Rstudio.cloud account (alt: Google, Github)
  • Git*** account
    • alt Gitlab: Google/ Twitter/ Github / Bitbucket
    • alt Github: none
  • Zenodo account (alt: Github, ORCID)

You can delete all online materials at the end of the class.

Getting our hands dirty

Rather than squint on code on the screen, let's … replicate my replication. Online. Now.

Rstudio.cloud

First task: Make a copy

Destination:

Gitlab logo Gitlab logo Gitlab logo

Using Gitlab: Two options

Import project from Git

gitlab import project

Fork an existing Gitlab project:

GITLAB.com/larsvilhuber/jobcreationblog

gitlab fork project

Your own Gitlab "jobcreationblog" repo

Goal 2: Be able to curate the data and code necessary for reproducible analysis

scan

Next step: Rstudio.cloud

Rstudio.cloud

Logging on to the Rstudio.cloud server

Rstudio.cloud login

Rstudio.cloud workspace

While you do that

Other cloud-based compute environments:

Rstudio.cloud

  • R-focused

MyBinder.org

  • Origins with Jupyter
  • Julia, Python, and R
  • different approach

https://codeocean.com

  • Software-agnostic
    • R
    • Python
    • Stata !
    • Matlab !
    • others
  • but always scripted
  • integrated versioning of the entire compute capsule

Creating a new project

Rstudio.cloud workspace

Rstudio.cloud new project

Rstudio.cloud new project from Github

Creating a new project from Gitlab

https://gitlab.com/larsvilhuber/jobcreationblog

Github

(replace “larsvilhuber” with your own Gitlab name space, or your Github clone URL)

Gitlab clone button

Rstudio.cloud new project from Gitlab

Creating a new project from Gitlab

scan

Creating a new project from Gitlab

scan

Notes

You could have done the same thing on your laptop

  • you might not have (the same version of) Rstudio installed (free)
  • you might not have (the same version of) R installed (free)
  • you might have a Mac/ Windows/ Linux/ old / brand new machine

All of these are issues affecting computational reproducibility

However, they do not solve everything…

Open the README document

scan

A (solved) problem of dependencies

scan

Issues of dependencies (new)

You could have done the same thing on your laptop

  • you might not have (the same version of) Rstudio installed (free)
  • you might not have (the same version of) R installed (free)
  • you might have a Mac/ Windows/ Linux/ old / brand new machine
  • you might not have (the same version of) packages installed

Rstudio solves that for you

Go ahead, click on “install”

scan

Solving dependencies

The problem is not just in R:

  • SSC or Stata Journal packages in Stata
  • libraries or compilers in Fortran
  • Modules (paid!) in SPSS or SAS
  • packages in Python (and versions of Python!)

XKCD 1987

Solving dependencies (R)

  • use packrat or checkpoint functionality
  • declare dependencies explicitly [1]
####################################
# global libraries used everywhere #
####################################
# Package lock in - optional
MRAN.snapshot <- "2019-01-01"
options(repos = c(CRAN = paste0("https://mran.revolutionanalytics.com/snapshot/",MRAN.snapshot)))
pkgTest <- function(x)
{
        if (!require(x,character.only = TRUE))
        {
                install.packages(x,dep=TRUE)
                if(!require(x,character.only = TRUE)) stop("Package not found")
        }
        return("OK")
}
global.libraries <- c("dplyr","devtools","rprojroot","tictoc")
results <- sapply(as.list(global.libraries), pkgTest)

Solving dependencies (Stata)

  • install packages locally [1]
  • commit as part of the repository
// Make a path local to the project
// Also see my related config.do at 
//   https://gist.github.com/larsvilhuber/6bcf4ff820285a1f1b9cfff2c81ca02b

local pwd "/c/path/to/project" 
capture mkdir `pwd'/ado

sysdir set PERSONAL `pwd'/ado/personal
sysdir set PLUS     `pwd'/ado/plus
sysdir set SITE `pwd'/ado/site

/* Now install them */
/*--- SSC packages ---*/
foreach pkg in outreg esttab someprog {
  ssc install `pkg'
}

Result

Goal 3: Robustness and automation - getting close to push-button reproducibility

By solving dependencies explicitly, robustness is improved.

By doing so with a dynamic function, automation is possible.

Goal 4: Correctly document reproducible research

Documenting dependencies is a critical part of reproducible research.

Packages installed?

Add text to the document:

# hidden dependency, will install packages that are needed
source("global-config.R",echo=FALSE)

and adjust global-config.R to also list knitcitations

Click on “Knit”

rendering errors

Dependencies again

scan

And another problem (maybe)

Enable popups for this site:

scan

Problem solved NOW?

You should have seen a pop-up window with the compiled text

  • do the graphs look the same?
  • does the text look the same?

Success!

Question:

Are we done?

Not quite…

Important

  • record any changes (Goal 4)
  • how permanent is the data we are using? (Goal 2)
  • how permanent is my document? (Goal 2)

Useful

  • how can others easily see my latest version? (Goal 3)

Next steps

Recording and documenting changes

  • using Git**b!

Making the data permanent

  • using Zenodo again

Making the page more permanent

Making the page more accessible

Recording changes

Your almost there

Because we used Git**b, we have the changes already under control:

Rstudio top right

Commit all the changes

Rstudio git

Commit all the changes

Rstudio git boxes checked

Review the changes

Rstudio review

Push back to repository

Rstudio git push

Compare the changes: Version Control

Since we used Gitlab, you can compare the changes: https://gitlab.com/larsvilhuber/jobcreationblog/compare/v1.0…master?view=parallel

scan

This is one way to recover changes, and then verbosely describe them to the editor/thesis advisor/etc.!

Lessons learned

Goal 1: Identify all the elements of a fully reproducible analysis

Data, source document, dependencies

Goal 2: Be able to curate the data and code necessary for reproducible analysis

Today:

  • source document
  • Gitlab

Tomorrow:

  • input data
  • output document

Goal 3: Robustness and automation - getting close to push-button reproducibility

  • Rmarkdown document has code, text, and figures
  • Dependencies identified, addressed

Goal 4: Correctly document reproducible research

  • Gitlab version control to document changes
  • Documenting dependencies for clarity

Thank you for today