Replication and Reproducibility in Social Sciences and Statistics: Day 2

Lars Vilhuber
2019-10-02

Cornell University

When we stopped

Recording and documenting changes

using Git**b! ✔️

Making the data permanent

using Zenodo again

Making the page more permanent

Using Zenodo

Making the page more accessible

on Github
on Gitlab

How permanent is the data?

The data is obtained from a Census Bureau website.

The website http://www2.census.gov/ces/bds/ might be re-organized and disappear
The data format might change
The API might change
We only need two small chunks of code

Making the data more permanent

Multiple options

We will use Zenodo, but all the others are just as good!

We will upload manually, but there's also an API

Getting started on Zenodo

We will NOT use the regular Zenodo; rather, we will test in the Sandbox.

https://sandbox.zenodo.org/

Check your URL bar! There's no other indication that this is not the real Zenodo!

Tutorial:

https://library.cfa.harvard.edu/data-archiving-and-sharing (Harvard Center for Astrophysics)

Source data (listed in the R code):

Do it now

Result

Result

DOI and Linkages

Result

Goal 2: Be able to curate the data and code necessary for reproducible analysis

We have archived unreliable data in a reliable location. ✔️

Adding configuration information

We can now use the following information to augment the replication:

# Zenodo DOI
zenodo.prefix <- "10.5281/zenodo"
# Specific DOI - resolves to a fixed version
zenodo.id <- "2649598"
# We will recover the rest from Zenodo API
zenodo.api = "https://zenodo.org/api/records/"

(Behind the scenes)

We will parse the information that Zenodo gives us through an API:

https://zenodo.org/api/records/2649598

Automating the data acquisition

# needs rjson, tidyr, dplyr

We download the metadata from the API:

download.file(paste0(zenodo.api,zenodo.id),destfile=file.path(dataloc,"metadata.json"))

We read the JSON in:

latest <- fromJSON(file=file.path(dataloc,"metadata.json"))

We get the links to the actual CSV files (and the codebook):

file.list <- as.data.frame(latest$files) %>% select(starts_with("self")) %>% gather()

We download all the csv files, by looking whether the filename has csv in it:

for ( value in file.list$value ) {
    print(value)
    if ( grepl("csv",value ) ) {
        print("Downloading...")
        file.name <- basename(value)
        download.file(value,destfile=file.path(dataloc,basename(value)))
    }
}

Adding it to your copy of the code

You can now add this to your copy of the code:

Result

Goal 3: Robustness and automation - getting close to push-button reproducibility

Hold on

Another goal

Goal 4: Correctly document reproducible research

And: not make your collaborators mad…

Making code changes cautiously (branching)

If we want to incorporate the Zenodo data

We could

make all the changes right away
possibly mess up the live site/ latest version of the paper?
maybe annoy our co-authors?

But we used a version control system with branching!

We instead

create a new branch zenodo
made all the changes there
can compare the changes to the main branch
consult with our co-authors before pulling the changes back into the main branch
our live site/paper remains valid the entire time

Do it now

create a new branch zenodo
copy the code from my repository (reference: https://github.com/larsvilhuber/jobcreationblog/blob/zenodo/01_download_replication_data.R)
test that it works (hit Knit)
commit to your own repository
push to Git**b

Compare the changes: Version Control

You can compare the changes: https://gitlab.com/larsvilhuber/jobcreationblog/compare/master…zenodo?view=parallel

Compare the changes: Version Control

We could then proceed to incorporate (pull or merge) the changes into the main repository:

Final result

The final result

will pull data from Zenodo
will reliably reproduce the graph as presented today
will use citable data (DOI = 10.5281/zenodo.2649598)
will have been achieved using replicable methods (before/ after)

Lessons learned

Goal 1: Identify all the elements of a fully reproducible analysis

Data, source document, dependencies

Goal 2: Be able to curate the data and code necessary for reproducible analysis

So far:

source document
Gitlab
input data ✔️

Still left:

output document

Goal 3: Robustness and automation - getting close to push-button reproducibility

Rmarkdown document has code, text, and figures
Dependencies identified, addressed
Download automated ✔️

Goal 4: Correctly document reproducible research

Gitlab version control to document changes
Documenting dependencies for clarity
Incorporating changes in a transparent (reproducible) way ✔️

Now

Next steps

Making your research visible
Archiving your research

Making your research visible

Github Pages and Gitlab Pages are an easy way to publish project pages

You've alread seen one: the original replication project page:

https://larsvilhuber.github.io/jobcreationblog/README.html

replicated page

How do Github Pages work?

Will display “static” web pages
- like the HTML page generated by the Knit button
Needs to be configured from the Settings

settings

Now we have a web page!

Didn't we say those are not archives?

How to create an archive from your research project

Only works with Github (for now)
Log on to Zenodo
Go to Account -> Settings -> Github
Connect your Github account

Create a release

Automatically creates an archive on Zenodo

!(release)[images/Github_Releases_2.png]

!(release)[images/Github_Zenodo_4.png]

Final result

The final result

pulls data from Zenodo
reliably reproduces the graph as presented today
uses citable data (DOI = 10.5281/zenodo.2649598)
was achieved using replicable methods (before/ after is viewable)
is citable itself (DOI = 10.5281/zenodo.400356)
is accessible ( https://larsvilhuber.github.io/jobcreationblog/README.html)

Lessons learned

Goal 1: Identify all the elements of a fully reproducible analysis

Data, source document, dependencies

Goal 2: Be able to curate the data and code necessary for reproducible analysis

source document
Gitlab
input data
output document ✔️

Goal 3: Robustness and automation - getting close to push-button reproducibility

Rmarkdown document has code, text, and figures
Dependencies identified, addressed
Download automated

Goal 4: Correctly document reproducible research

Gitlab version control to document changes
Documenting dependencies for clarity
Incorporating changes in a transparent (reproducible) way

Conclusion

Replication can be a lot of work

We've touched on

Replication per se
Replicable documents
Possible pitfalls of software dependencies
Cloud computing platforms
Permanence of source material (website, data) and how to solve it

project

Conclusion

We have not covered everything

… because there can be a lot more

HP computing (length, quantity, throughput)
Issues with commercial (paid) software (access, permanence)
Data that is not public-use
Data in a locked room

SafePODS

Thank you

Presentation: https://labordynamicsinstitute.github.io/replication-tutorial-2019

Source: https://github.com/labordynamicsinstitute/replication-tutorial-2019