Replication and Reproducibility in Social Sciences and Statistics: Overview and Practice

Lars Vilhuber
2019-10-01

Cornell University

Overview

High-level overview (60:00)
A very concrete example (remainder)

Replication and Reproducibility in Social Sciences and Statistics: Context, Concerns, and Concrete Measures

(alt)

Goals of this tutorial

Goal 1: Identify all the elements of a fully reproducible analysis
Goal 2: Be able to curate the data and code necessary for reproducible analysis
Goal 3: Robustness and automation - getting close to push-button reproducibility
Goal 4: Correctly document reproducible research

Requirements

web browser
some R knowledge (not much)

Sub goals

show you enough of the toolkit to have you explore more
recognize (some) of the limitations
NOT make you a master of this today

Let's get started

Details: Goal 1: Elements of a fully reproducible analysis

Consider the AEA's suggested README and the Social Science Data Editors' guidance for verification:

Details: Goal 1

Elements

Data (where possible)
Data provenance
Instructions
Code (always*)
Expected results
Persistence

Goal 1: Elements: Data (where possible)

Old method: send the journal a ZIP file
Source: Your laptop
Destination: random file on a journal website

Questions/ What-ifs:

the data is not on your laptop?
- too big
- on server
- a database
the data is not yours to send
- confidentiality
- proprietary
- other licensing issues

Goal 1: Elements: Data (where possible)

Old method: send the journal a ZIP file
Source: Your laptop
Destination: random file on a journal website

Questions/ What-ifs:

how did the data get to your laptop?
how did the data get generated?

These are provenance questions.

Goal 1: Elements: Data (where possible)

Old method: send the journal a ZIP file
Source: Your laptop
Destination: random file on a journal website

Questions/ What-ifs:

is the ZIP file complete?
are the ZIP file contents curated (preserved)?
can the data be re-used?
can the data be properly attributed to the creator?
can the data be found independently of the article?

These are FAIR questions

FAIR Data Principles

Findable
Accessible
Interoperable
Re-Usable

FAIR Data Principles

The Example

The Census Bureau put out a blog post with data.

I attempted to replicate it
The replication itself should be replicable
Focus here: my replication of the blog post

http://researchmatters.blogs.census.gov/2016/12/01/how-much-do-startups-impact-employment-growth-in-the-u-s/

original page

The Context

the original page:

original page

the replication project page: https://larsvilhuber.github.io/jobcreationblog/README.html

replicated page

We are going to focus on 1 figure

Original

original

Replicated

replicated

Let's start

Elements of Goal 1

Elements

Data (where possible)
Data provenance
Instructions
Code (always*)
Expected results
Persistence

Here:

Data is small (< 75k) - ✔️
Data provenance - URL ✔️
Instructions - RMarkdown file ✔️
Code - Within Rmd ✔️
Expected results - Copy of Figure ✔️
Persistence - Github ✔️

Hold on

Elements of Goal 1

Elements

Data (where possible)
Data provenance
Instructions
Code (always*)
Expected results
Persistence

Here:

Data is small (< 75k) - ✔️
Data provenance - URL ??
Instructions - RMarkdown file
Code - Within Rmd
Expected results - Copy of Figure
Persistence - Github ??

Data provenance

“Data” in this project:

the blog post
the underlying data

Where available:

First problem

the original page: http://researchmatters.blogs.census.gov/2016/12/01/how-much-do-startups-impact-employment-growth-in-the-u-s/

oops

Safeguarding scientific output

The role of journals is to provide a permanent record of scientific knowledge.

how reliable is that record?
where are journals stored?
what if the information is not in a journal?

old library

Safeguarding scientific output

journals disappear, as do websites
paper journals are stored in libraries
e-journals in a system called LOCKSS = Lots of Copies Keep Stuff Safe
data should be stored in repositories

stacks

These are still fallible

What is NOT safeguarded

random URLs
Github repositories

Github not found

Solving the first snag

We use the Internet Archive

http://researchmatters.blogs.census.gov/2016/12/01/how-much-do-startups-impact-employment-growth-in-the-u-s/

Solving the first snag

to archive websites:

https://web.archive.org/web/20161229210623/http://researchmatters.blogs.census.gov/2016/12/01/how-much-do-startups-impact-employment-growth-in-the-u-s/

Exercise

Pick a random website
- Stephen Fienberg - http://www.stat.cmu.edu/~fienberg/
- CIQSS - https://www.ciqss.org/a-propos
- etc.
Figure out if there is a Web Archive copy of it
- Useful: WayBack Machine Firefox Plugin
If not, get the Web Archive (aka WayBack Machine) to create a copy
Cite the Web Archive version of the page

Do it now

Result

Goal 2: Be able to curate the data and code necessary for reproducible analysis

Let's start... again

The replication of the original

the project page:

replicated page

the code behind it: https://github.com/larsvilhuber/jobcreationblog

Github

What is Github?

git

Distributed version control system, created by Linus Torvalds in 2005 branching system

Github.com

At a high level, GitHub is a website and cloud-based service that helps developers store and manage their code, as well as track and control changes to their code. [1]

Also a collaboration tool when multiple people (developers, researchers) collaborate in a structured fashion on text/code/programs/etc.

Gitlab? Github? Git? What's up with that?

Gitlab logo

Both GitLab (and GitLab.com) and GitHub (and GitHub.com) are products providing Git repository hosting service. [1]

Also:

Bitbucket.com (despite no Git in the name)
All of these have free plans for
- private (non-public) repositories
- public repositories

at least for academics.

Training for Git

Many training opportunities and tutorials out there

What Github, Gitlab, etc. are NOT

While these sites make it really easy to publish your code/website/etc.

They are NOT archives. Github not found

Github etc. are transitory

Github pages, much as private websites, can be unpublished at any time:

Github unpublish

In fact, the entire code repository can be deleted at any time:

Github delete

Git(hub,lab) as a tool for reproducible research

Goal 3: Robustness and automation - getting close to push-button reproducibility

(Advanced features of Git(hub,lab) allows us to implement and test that)

Goal 4: Correctly document reproducible research

Git(hub,lab) allow us to freeze consistent versions of input data, code, and output as “releases”, “versions”, etc., thus making it easy to document your code

Git(hub,lab) as a tool for reproducible research

Goal 3: Robustness and automation - getting close to push-button reproducibility

(Advanced features of Git(hub,lab) allows us to implement and test that)

Goal 4: Correctly document reproducible research

(also respond to thesis advisor, referree, editor, curious journalist asking the question “what has changed”)

changes

Getting our hands dirty

Rather than squint on code on the screen, let's … replicate my replication. Online. Now.

We will

Make a copy of the code repository
Verify that the original code runs, and make any changes necessary
Make a (permanent) copy of the code
Address any issues on the way

Requirements:

Rstudio.cloud account (alt: Google, Github)
Git*** account
- alt Gitlab: Google/ Twitter/ Github / Bitbucket
- alt Github: none
Zenodo account (alt: Github, ORCID)

You can delete all online materials at the end of the class.

Getting our hands dirty

Rather than squint on code on the screen, let's … replicate my replication. Online. Now.

Go to https://rstudio.cloud

First task: Make a copy

Source: https://github.com/larsvilhuber/jobcreationblog

Github

Destination:

Using Gitlab: Two options

Import project from Git

gitlab import project

Fork an existing Gitlab project:

GITLAB.com/larsvilhuber/jobcreationblog

gitlab fork project

Your own Gitlab "jobcreationblog" repo

Goal 2: Be able to curate the data and code necessary for reproducible analysis

Next step: Rstudio.cloud

Logging on to the Rstudio.cloud server

Rstudio.cloud login

Rstudio.cloud workspace

While you do that

Other cloud-based compute environments:

Rstudio.cloud

R-focused

MyBinder.org

Origins with Jupyter
Julia, Python, and R
different approach

https://codeocean.com

Software-agnostic
- R
- Python
- Stata !
- Matlab !
- others
but always scripted
integrated versioning of the entire compute capsule

Creating a new project

Rstudio.cloud workspace

Rstudio.cloud new project

Rstudio.cloud new project from Github

Creating a new project from Gitlab

https://gitlab.com/larsvilhuber/jobcreationblog

Github

(replace “larsvilhuber” with your own Gitlab name space, or your Github clone URL)

Gitlab clone button

Rstudio.cloud new project from Gitlab

Creating a new project from Gitlab

Notes

You could have done the same thing on your laptop

you might not have (the same version of) Rstudio installed (free)
you might not have (the same version of) R installed (free)
you might have a Mac/ Windows/ Linux/ old / brand new machine

All of these are issues affecting computational reproducibility

However, they do not solve everything…

Open the README document

A (solved) problem of dependencies

Issues of dependencies (new)

You could have done the same thing on your laptop

you might not have (the same version of) Rstudio installed (free)
you might not have (the same version of) R installed (free)
you might have a Mac/ Windows/ Linux/ old / brand new machine
you might not have (the same version of) packages installed

Rstudio solves that for you

Go ahead, click on “install”

Solving dependencies

The problem is not just in R:

SSC or Stata Journal packages in Stata
libraries or compilers in Fortran
Modules (paid!) in SPSS or SAS
packages in Python (and versions of Python!)

Solving dependencies (R)

use packrat or checkpoint functionality
declare dependencies explicitly [1]

####################################
# global libraries used everywhere #
####################################
# Package lock in - optional
MRAN.snapshot <- "2019-01-01"
options(repos = c(CRAN = paste0("https://mran.revolutionanalytics.com/snapshot/",MRAN.snapshot)))
pkgTest <- function(x)
{
        if (!require(x,character.only = TRUE))
        {
                install.packages(x,dep=TRUE)
                if(!require(x,character.only = TRUE)) stop("Package not found")
        }
        return("OK")
}
global.libraries <- c("dplyr","devtools","rprojroot","tictoc")
results <- sapply(as.list(global.libraries), pkgTest)

Solving dependencies (Stata)

install packages locally [1]
commit as part of the repository

// Make a path local to the project
// Also see my related config.do at 
//   https://gist.github.com/larsvilhuber/6bcf4ff820285a1f1b9cfff2c81ca02b

local pwd "/c/path/to/project" 
capture mkdir `pwd'/ado

sysdir set PERSONAL `pwd'/ado/personal
sysdir set PLUS     `pwd'/ado/plus
sysdir set SITE `pwd'/ado/site

/* Now install them */
/*--- SSC packages ---*/
foreach pkg in outreg esttab someprog {
  ssc install `pkg'
}

Result

Goal 3: Robustness and automation - getting close to push-button reproducibility

By solving dependencies explicitly, robustness is improved.

By doing so with a dynamic function, automation is possible.

Goal 4: Correctly document reproducible research

Documenting dependencies is a critical part of reproducible research.

Packages installed?

Add text to the document:

# hidden dependency, will install packages that are needed
source("global-config.R",echo=FALSE)

and adjust global-config.R to also list knitcitations

Click on “Knit”

rendering errors

Dependencies again

And another problem (maybe)

Enable popups for this site:

Problem solved NOW?

You should have seen a pop-up window with the compiled text

do the graphs look the same?
does the text look the same?

Question:

Are we done?

Not quite…

Important

record any changes (Goal 4)
how permanent is the data we are using? (Goal 2)
how permanent is my document? (Goal 2)

Useful

how can others easily see my latest version? (Goal 3)

Next steps

Recording and documenting changes

using Git**b!

Making the data permanent

using Zenodo again

Making the page more permanent

Using Zenodo

Making the page more accessible

on Github
on Gitlab

Recording changes

Your almost there

Because we used Git**b, we have the changes already under control:

Commit all the changes

Commit all the changes

Review the changes

Push back to repository

Compare the changes: Version Control

Since we used Gitlab, you can compare the changes: https://gitlab.com/larsvilhuber/jobcreationblog/compare/v1.0…master?view=parallel

This is one way to recover changes, and then verbosely describe them to the editor/thesis advisor/etc.!

Lessons learned

Goal 1: Identify all the elements of a fully reproducible analysis

Data, source document, dependencies

Goal 2: Be able to curate the data and code necessary for reproducible analysis

Today:

source document
Gitlab

Tomorrow:

input data
output document

Goal 3: Robustness and automation - getting close to push-button reproducibility

Rmarkdown document has code, text, and figures
Dependencies identified, addressed

Goal 4: Correctly document reproducible research

Gitlab version control to document changes
Documenting dependencies for clarity

Thank you for today

Presentation: https://labordynamicsinstitute.github.io/replication-tutorial-2019

Source: https://github.com/labordynamicsinstitute/replication-tutorial-2019

Next day: here

CC-BY-4.0 Creative Commons Attribution-NonCommercial 4.0 International Public License