Data Sharing and Archiving for Reproducibility (RT2 2021): Part 2

Lars Vilhuber
2021-09-03

Cornell University

Previously

node1

Survey forms - ✔️✔️
Metadata - ✔️✔️
Sample data - ✔️✔️
Actual data

zenodo filelist

Result

Goal 1: Be able to curate the data necessary for reproducible analysis

We have archived sample data and provenance information (survey form) in a reliable location. ✔️

Goal 2: Know when to do so

We did so at the earliest possible moment. ✔️

Next step

node23

Collect data
Update archive

node345

Use preserved data
Analyze data

Update the data

node23

using the Google Sheet with updated Survey data

Publish the data

node23

at the Zenodo deposit https://sandbox.zenodo.org/record/910136 (this one only works for the presenter)

Using the data

Adding configuration information

We can now use the following information to augment the replication:

# Zenodo DOI prefix for Sandbox
zenodo.prefix <- "10.5072/zenodo"
# Specific ID for my deposit - resolves to a latest version!
zenodo.id <- "910135"
# We will recover the rest from Zenodo API
zenodo.api = "https://sandbox.zenodo.org/api/records/"

(Behind the scenes)

We will parse the information that Zenodo gives us through an API:

https://sandbox.zenodo.org/api/records/910135

Automating the data acquisition

# needs rjson, tidyr, dplyr

We download the metadata from the API (to see what this looked like before we made any changes, see this version in the Github repository)

download.file(paste0(zenodo.api,zenodo.id),destfile=file.path(dataloc,"metadata.json"))

We read the JSON in:

latest <- fromJSON(file=file.path(dataloc,"metadata.json"))

We get the links to the actual XLSX files (and the Survey):

file.list <- as.data.frame(latest$files) %>% select(starts_with("self")) %>% gather()

Automating the data acquisition

We download all the xlsx files, by looking whether the filename has xlsx in it:

for ( value in file.list$value ) {
    print(value)
    if ( grepl("xlsx",value ) ) {
        print("Downloading...")
        file.name <- basename(value)
        download.file(value,destfile=file.path(workpath,basename(value)))
    } else {
      print("Skipping.")
    }
}

[1] "https://sandbox.zenodo.org/api/files/b31d6ec7-fb8a-43a0-9bcd-a6e3fadedc65/browser-survey.xlsx"
[1] "Downloading..."
[1] "https://sandbox.zenodo.org/api/files/b31d6ec7-fb8a-43a0-9bcd-a6e3fadedc65/survey-print-version-20210902.pdf"
[1] "Skipping."

Re-use of our own archive

Now that we have downloaded our own archival version of the survey, we can use it:

browser_survey <- read_excel(file.path(workpath,"browser-survey.xlsx"))

We should also preserve WHICH version we are actually using:

# The deposit ID always points to the latest version, but we want to identify which version that is:
latest.doi <- latest$doi
latest.doi

[1] "10.5072/zenodo.910834"

Crosstab

browser_survey %>% 
  mutate(num_tabs = as.numeric(`How many browser tabs do you have open?`)) %>%
  group_by(`What browser do you use?`) %>%
  summarize(`Mean number of tabs` = mean(num_tabs,na.rm=TRUE)) -> table

Result

node345

Table: Browser tabs by browser type.

What browser do you use?	Mean number of tabs
Chrome	16.000
Edge	2.000
Firefox	21.875
Safari	16.250

Source: https://doi.org/10.5072/zenodo.910834,
created 2021-09-03T17:19:46.872665+00:00.

Note: DOI is fake.

Lessons learned

cycle

Goal 1: Be able to curate the data necessary for reproducible analysis ✔️
Goal 2: Know when to do so ✔️
Goal 3: Choose license (while respecting ethics)

zenodo filelist

or latest!

Next steps

Conclusion

Thank you

Presentation: https://labordynamicsinstitute.github.io/tutorial-data-sharing-archiving-2021

Source: https://github.com/labordynamicsinstitute/tutorial-data-sharing-archiving-2021