Replication and Reproducibility in Social Sciences and Statistics: Day 2

Lars Vilhuber
2019-10-02

Cornell University

When we stopped

Recording and documenting changes

  • using Git**b! ✔️

Making the data permanent

  • using Zenodo again

Making the page more permanent

Making the page more accessible

How permanent is the data?

The data is obtained from a Census Bureau website.

  • The website http://www2.census.gov/ces/bds/ might be re-organized and disappear
  • The data format might change
  • The API might change
  • We only need two small chunks of code

Making the data more permanent

We will use Zenodo, but all the others are just as good!

  • We will upload manually, but there's also an API

zenodo

Getting started on Zenodo

We will NOT use the regular Zenodo; rather, we will test in the Sandbox.

https://sandbox.zenodo.org/

Check your URL bar! There's no other indication that this is not the real Zenodo!

Do it now

scan

Result

zenodo deposit

Result

zenodo deposit

DOI and Linkages

zenodo deposit

Result

Goal 2: Be able to curate the data and code necessary for reproducible analysis

We have archived unreliable data in a reliable location. ✔️

Adding configuration information

We can now use the following information to augment the replication:

# Zenodo DOI
zenodo.prefix <- "10.5281/zenodo"
# Specific DOI - resolves to a fixed version
zenodo.id <- "2649598"
# We will recover the rest from Zenodo API
zenodo.api = "https://zenodo.org/api/records/"

(Behind the scenes)

We will parse the information that Zenodo gives us through an API:

https://zenodo.org/api/records/2649598

zenodo api

Automating the data acquisition

# needs rjson, tidyr, dplyr

We download the metadata from the API:

download.file(paste0(zenodo.api,zenodo.id),destfile=file.path(dataloc,"metadata.json"))

We read the JSON in:

latest <- fromJSON(file=file.path(dataloc,"metadata.json"))

We get the links to the actual CSV files (and the codebook):

file.list <- as.data.frame(latest$files) %>% select(starts_with("self")) %>% gather()

We download all the csv files, by looking whether the filename has csv in it:

for ( value in file.list$value ) {
    print(value)
    if ( grepl("csv",value ) ) {
        print("Downloading...")
        file.name <- basename(value)
        download.file(value,destfile=file.path(dataloc,basename(value)))
    }
}

Adding it to your copy of the code

You can now add this to your copy of the code:

Result

Goal 3: Robustness and automation - getting close to push-button reproducibility

Hold on

scan

Another goal

Goal 4: Correctly document reproducible research

And: not make your collaborators mad…

Making code changes cautiously (branching)

If we want to incorporate the Zenodo data

We could

  • make all the changes right away
  • possibly mess up the live site/ latest version of the paper?
  • maybe annoy our co-authors?

But we used a version control system with branching!

We instead

  • create a new branch zenodo
  • made all the changes there
  • can compare the changes to the main branch
  • consult with our co-authors before pulling the changes back into the main branch
  • our live site/paper remains valid the entire time

Do it now

scan

Compare the changes: Version Control

Compare the changes: Version Control

We could then proceed to incorporate (pull or merge) the changes into the main repository:

scan

Read more about it at

Final result

The final result

  • will pull data from Zenodo
  • will reliably reproduce the graph as presented today
  • will use citable data (DOI = 10.5281/zenodo.2649598)
  • will have been achieved using replicable methods (before/ after)

Lessons learned

Goal 1: Identify all the elements of a fully reproducible analysis

Data, source document, dependencies

Goal 2: Be able to curate the data and code necessary for reproducible analysis

So far:

  • source document
  • Gitlab
  • input data ✔️

Still left:

  • output document

Goal 3: Robustness and automation - getting close to push-button reproducibility

  • Rmarkdown document has code, text, and figures
  • Dependencies identified, addressed
  • Download automated ✔️

Goal 4: Correctly document reproducible research

  • Gitlab version control to document changes
  • Documenting dependencies for clarity
  • Incorporating changes in a transparent (reproducible) way ✔️

Now

scan

Next steps

  • Making your research visible
  • Archiving your research

Making your research visible

Github Pages and Gitlab Pages are an easy way to publish project pages

  • You've alread seen one: the original replication project page:

https://larsvilhuber.github.io/jobcreationblog/README.html

replicated page

How do Github Pages work?

  • Will display “static” web pages
    • like the HTML page generated by the Knit button
  • Needs to be configured from the Settings

settings

Now we have a web page!

Didn't we say those are not archives?

How to create an archive from your research project

zenodo_github

Create a release

zenodo_github

Create a release

zenodo_github

Automatically creates an archive on Zenodo

!(release)[images/Github_Releases_2.png]

!(release)[images/Github_Zenodo_4.png]

Final result

The final result

  • pulls data from Zenodo
  • reliably reproduces the graph as presented today
  • uses citable data (DOI = 10.5281/zenodo.2649598)
  • was achieved using replicable methods (before/ after is viewable)
  • is citable itself (DOI = 10.5281/zenodo.400356)
  • is accessible ( https://larsvilhuber.github.io/jobcreationblog/README.html)

Lessons learned

Goal 1: Identify all the elements of a fully reproducible analysis

Data, source document, dependencies

Goal 2: Be able to curate the data and code necessary for reproducible analysis

  • source document
  • Gitlab
  • input data
  • output document ✔️

Goal 3: Robustness and automation - getting close to push-button reproducibility

  • Rmarkdown document has code, text, and figures
  • Dependencies identified, addressed
  • Download automated

Goal 4: Correctly document reproducible research

  • Gitlab version control to document changes
  • Documenting dependencies for clarity
  • Incorporating changes in a transparent (reproducible) way

Conclusion

Conclusion

Replication can be a lot of work

We've touched on

  • Replication per se
  • Replicable documents
  • Possible pitfalls of software dependencies
  • Cloud computing platforms
  • Permanence of source material (website, data) and how to solve it

project

Conclusion

We have not covered everything

… because there can be a lot more

  • HP computing (length, quantity, throughput)
  • Issues with commercial (paid) software (access, permanence)
  • Data that is not public-use
  • Data in a locked room

SafePODS

Thank you