Lars Vilhuber
2021-09-03
Cornell University
Goal 1: Be able to curate the data necessary for reproducible analysis
We have archived sample data and provenance information (survey form) in a reliable location. ✔️
Goal 2: Know when to do so
We did so at the earliest possible moment. ✔️
using the Google Sheet with updated Survey data
at the Zenodo deposit https://sandbox.zenodo.org/record/910136 (this one only works for the presenter)
We can now use the following information to augment the replication:
# Zenodo DOI prefix for Sandbox
zenodo.prefix <- "10.5072/zenodo"
# Specific ID for my deposit - resolves to a latest version!
zenodo.id <- "910135"
# We will recover the rest from Zenodo API
zenodo.api = "https://sandbox.zenodo.org/api/records/"
We will parse the information that Zenodo gives us through an API:
# needs rjson, tidyr, dplyr
We download the metadata from the API (to see what this looked like before we made any changes, see this version in the Github repository)
download.file(paste0(zenodo.api,zenodo.id),destfile=file.path(dataloc,"metadata.json"))
We read the JSON in:
latest <- fromJSON(file=file.path(dataloc,"metadata.json"))
We get the links to the actual XLSX files (and the Survey):
file.list <- as.data.frame(latest$files) %>% select(starts_with("self")) %>% gather()
We download all the xlsx files, by looking whether the filename has xlsx
in it:
for ( value in file.list$value ) {
print(value)
if ( grepl("xlsx",value ) ) {
print("Downloading...")
file.name <- basename(value)
download.file(value,destfile=file.path(workpath,basename(value)))
} else {
print("Skipping.")
}
}
[1] "https://sandbox.zenodo.org/api/files/b31d6ec7-fb8a-43a0-9bcd-a6e3fadedc65/browser-survey.xlsx"
[1] "Downloading..."
[1] "https://sandbox.zenodo.org/api/files/b31d6ec7-fb8a-43a0-9bcd-a6e3fadedc65/survey-print-version-20210902.pdf"
[1] "Skipping."
Now that we have downloaded our own archival version of the survey, we can use it:
browser_survey <- read_excel(file.path(workpath,"browser-survey.xlsx"))
We should also preserve WHICH version we are actually using:
# The deposit ID always points to the latest version, but we want to identify which version that is:
latest.doi <- latest$doi
latest.doi
[1] "10.5072/zenodo.910834"
browser_survey %>%
mutate(num_tabs = as.numeric(`How many browser tabs do you have open?`)) %>%
group_by(`What browser do you use?`) %>%
summarize(`Mean number of tabs` = mean(num_tabs,na.rm=TRUE)) -> table
Table: Browser tabs by browser type.
What browser do you use? | Mean number of tabs |
---|---|
Chrome | 16.000 |
Edge | 2.000 |
Firefox | 21.875 |
Safari | 16.250 |
Source: https://doi.org/10.5072/zenodo.910834,
created 2021-09-03T17:19:46.872665+00:00.
Note: DOI is fake.
or latest!