Data Sharing and Archiving for Reproducibility (RT2 2021): Part 2

Lars Vilhuber
2021-09-03

Cornell University

Previously

node1

  • Survey forms - ✔️✔️
  • Metadata - ✔️✔️
  • Sample data - ✔️✔️
  • Actual data

zenodo filelist

Result

Goal 1: Be able to curate the data necessary for reproducible analysis

We have archived sample data and provenance information (survey form) in a reliable location. ✔️

Goal 2: Know when to do so

We did so at the earliest possible moment. ✔️

Next step

node23

  • Collect data
  • Update archive

node345

  • Use preserved data
  • Analyze data

Update the data

Publish the data

node23

scan

at the Zenodo deposit https://sandbox.zenodo.org/record/910136 (this one only works for the presenter)

Using the data

node23

Adding configuration information

We can now use the following information to augment the replication:

# Zenodo DOI prefix for Sandbox
zenodo.prefix <- "10.5072/zenodo"
# Specific ID for my deposit - resolves to a latest version!
zenodo.id <- "910135"
# We will recover the rest from Zenodo API
zenodo.api = "https://sandbox.zenodo.org/api/records/"

(Behind the scenes)

We will parse the information that Zenodo gives us through an API:

https://sandbox.zenodo.org/api/records/910135

zenodo api

Automating the data acquisition

# needs rjson, tidyr, dplyr

We download the metadata from the API (to see what this looked like before we made any changes, see this version in the Github repository)

download.file(paste0(zenodo.api,zenodo.id),destfile=file.path(dataloc,"metadata.json"))

We read the JSON in:

latest <- fromJSON(file=file.path(dataloc,"metadata.json"))

We get the links to the actual XLSX files (and the Survey):

file.list <- as.data.frame(latest$files) %>% select(starts_with("self")) %>% gather()

Automating the data acquisition

We download all the xlsx files, by looking whether the filename has xlsx in it:

for ( value in file.list$value ) {
    print(value)
    if ( grepl("xlsx",value ) ) {
        print("Downloading...")
        file.name <- basename(value)
        download.file(value,destfile=file.path(workpath,basename(value)))
    } else {
      print("Skipping.")
    }
}
[1] "https://sandbox.zenodo.org/api/files/b31d6ec7-fb8a-43a0-9bcd-a6e3fadedc65/browser-survey.xlsx"
[1] "Downloading..."
[1] "https://sandbox.zenodo.org/api/files/b31d6ec7-fb8a-43a0-9bcd-a6e3fadedc65/survey-print-version-20210902.pdf"
[1] "Skipping."

Re-use of our own archive

Now that we have downloaded our own archival version of the survey, we can use it:

browser_survey <- read_excel(file.path(workpath,"browser-survey.xlsx"))

We should also preserve WHICH version we are actually using:

# The deposit ID always points to the latest version, but we want to identify which version that is:
latest.doi <- latest$doi
latest.doi
[1] "10.5072/zenodo.910834"

Crosstab

browser_survey %>% 
  mutate(num_tabs = as.numeric(`How many browser tabs do you have open?`)) %>%
  group_by(`What browser do you use?`) %>%
  summarize(`Mean number of tabs` = mean(num_tabs,na.rm=TRUE)) -> table

Result

node345

Table: Browser tabs by browser type.

What browser do you use? Mean number of tabs
Chrome 16.000
Edge 2.000
Firefox 21.875
Safari 16.250

Source: https://doi.org/10.5072/zenodo.910834,
created 2021-09-03T17:19:46.872665+00:00.

Note: DOI is fake.

Lessons learned

cycle

  • Goal 1: Be able to curate the data necessary for reproducible analysis ✔️
  • Goal 2: Know when to do so ✔️
  • Goal 3: Choose license (while respecting ethics)

zenodo filelist

or latest!

Next steps

Conclusion

Thank you