Preserving Survey Data

Lars Vilhuber

Laurel Krovetz

2026-06-14

labordynamicsinstitute.github.io/tutorial-preserving-survey/presentation/

The problem of credibility

How can we know that a data source is reliably obtained?

Consider the case of Gino

Francesca Gino

The case of Gino

Francesca Gino was a tenured professor at Harvard Business School, writing on honesty (!)

The case of Gino

Several articles were investigated by third parties (Data Colada, in particular ¹), and found to be problematic

The case of Gino

At least one of them had manipulated data AFTER it had been collected, BEFORE it had been analyzed.

A generic survey workflow

Generic survey processing

Requiring transparency in academia

Where we are headed

Modern verification processes

Verifying transparency in academia

Verification by journals

Provision (publication of materials) provides transparency
Verification (running the analysis again - computational reproducibility) compensates for mistrust/absence of trust

Which journals

American Economic Association (8)
Econometric Society (3)
Canadian Journal of Economics (1)
Royal Economic Society (2)
Western Economic Association International (1)
European Economic Association (1)
Review of Economic Studies (1)
Journal of the European Economic Association (1)
Journal of Political Economy (3)
American Journal of Political Science (1)
American Political Science Review (1)

Verification by others

Pre-publication: cascad

Post-publication: Data Colada, Institute for Replication

Verification by institutions

World Bank

Outline of the tutorial

Basic

how to process,
de-identify, and
analyze survey data
publish data

Expansion

how to download automatically
how to preserve automatically
how to do so in a credible and transparent fashion.

In a nutshell

We’ll use an API to retrieve the data, show you how to clean and strip the data of confidential information and non-consenting responses, and use another API to preserve the data.

Goals

Create a survey (in Qualtrics) for data collection.
Load the latest responses from the server (using an API)
Clean and process the data to remove non-public data automatically.
Preserve shareable data in a trusted repository
Later, publish those data with a credible record of when it was first preserved!

Some notes on Qualtrics

Create a survey (in Qualtrics) for data collection.
Load the latest responses from the server using an API

There is not much in this tutorial that requires Qualtrics.

You could do this with SurveyCTO, LimeSurvey, or any other system that has an API.
You could do this with Google Forms, if you have linked it to a Google Sheet.
You could do this with a lab experiment system that stores data in an SQL database

Some notes on removal of PII

Clean and process the data to remove non-public data automatically.

It is important to remove any PII or confidential information as soon as possible.

Some notes on removal of PII

That may not always be feasible. For instance, if you need geolocation to merge in contextual data, or compute distances, then some data processing may unavoidably require access to sensitive data.
But any data that is not needed should be removed early on.
This is not irreversible: if you later find that you need more data elements, you can always re-process the raw data, stored on Qualtrics, until your IRB requires that you delete those data.

Preservation

Preservation != publication, != sharing
In fact, preservation may mean: not very accessible at all!
Preservation is intended to maintain data for tens, even hundreds of years
- Preservation may involve curation: active transformation of the data for improved accessibility

Test

Who has a Github account?

Test

Who has a Github account?
How long does it take you to delete your entire Github repository, forever?

Demonstrating the core steps in R

The core steps

We will first walk through the core steps you can do by hand, in R:

Collect — create the survey and gather responses (in Qualtrics)
Download — export the responses from the web interface
Analyze — load, clean, and process the data in R
Publish — save the cleaned data, ready to share

This is not a full Qualtrics tutorial!

Creating a survey in Qualtrics

You’ll typically have access to a Qualtrics account through your university or organization. Then it is easy to construct a survey using the web tool.

Side-note: Survey definition from Qualtrics

You should not forget to preserve your survey definition!

Download a qsf file to save and transfer survey structure to have a backup survey template. Export as .qsf in Tools in Qualtrics.
Can also export survey as Word doc in Tools in Qualtrics. Choose this option to get a well-formatted document.

Side-note to side-note: Confidential data in survey definitions!

It is possible that your survey definition itself contains information that you are not allowed to publish:

You might be running the survey with a firm, and the firm does not want to be identified
You are asking questions about specific products, and the product names are confidential

It is actually hard to de-identify a qsf file. We will not try to do this here, but you should be aware of this issue.

Take our survey

Survey responses in Qualtrics

Responses can be easily checked at a glance in the Data and Analytics tab. 🔒

Downloading data

You can download data directly from this page

If you do this only once, downloading manually is fine.
Do it 2-3 times, you may want to program it!

Download options

You can download data directly from this page

Do it 2-3 times, you may want to program it!

Downloaded data

The data downloaded depends on parameters chosen. For instance, downloading as CSV with default settings yields

StartDate,EndDate,Status,Progress,Duration (in seconds),Finished,RecordedDate,ResponseId,DistributionChannel,UserLanguage,consent,age_1,gender,education,num_tabs_1,name_confidential,number_confidential
Start Date,End Date,Response Type,Progress,Duration (in seconds),Finished,Recorded Date,Response ID,Distribution Channel,User Language,"This brief survey will be used as a demonstration of how to collect data, clean the data and remove any confidential information, and publish the data. The information collected is entirely anonymous. It will be used as part of the tutorial for educational purposes. By continuing, you agree that the data you enter will be stored and used for these purposes. You do not need to fill out this information in order to participate in the tutorial. At any point you can choose to stop participating in the survey or not answer any question. Do you consent to participating in this survey?",What is your age? - Age (years),What is your gender?,What is your highest completed level of education?,"On your computer currently, how many open browser tabs do you have? - Number of tabs","This question will ask you for a piece of confidential information. Do not respond with a true answer, but instead make up a response. Question: what is your name?","This question will ask you for a piece of confidential information. Do not respond with a true answer, but instead make up a response. Question: what is your phone number?"
"{""ImportId"":""startDate"",""timeZone"":""America/New_York""}","{""ImportId"":""endDate"",""timeZone"":""America/New_York""}","{""ImportId"":""status""}","{""ImportId"":""progress""}","{""ImportId"":""duration""}","{""ImportId"":""finished""}","{""ImportId"":""recordedDate"",""timeZone"":""America/New_York""}","{""ImportId"":""_recordId""}","{""ImportId"":""distributionChannel""}","{""ImportId"":""userLanguage""}","{""ImportId"":""QID1""}","{""ImportId"":""QID2_1""}","{""ImportId"":""QID3""}","{""ImportId"":""QID4""}","{""ImportId"":""QID5_1""}","{""ImportId"":""QID6_TEXT""}","{""ImportId"":""QID7_TEXT""}"
2025-07-01 11:13:44,2025-07-01 11:14:18,IP Address,100,34,True,2025-07-01 11:14:19,R_5rYfeErcBsS3nsJ,anonymous,EN,Yes,24,Female,Master's degree,3,Harry Potter,555-555-5555
2025-07-01 11:23:01,2025-07-01 11:23:28,IP Address,100,26,True,2025-07-01 11:23:28,R_5rHTV2kfYGjFPep,anonymous,EN,No,21,Male,Bachelor's degree,11,Ronald Weasley,555-555-5555

Loading the downloaded data into R

You downloaded the responses from the Qualtrics web interface (previous slide). Now read that exported file into R.

datesuffix <- "June+16,+2026_15.35"
fileprefix <- "Testing+preservation_"
filename <- paste0(fileprefix, datesuffix, ".csv")

Some data org hygiene

We want to be careful about managing our data structure:³

# Path to the file you downloaded from Qualtrics
datapath <- here::here("data")
rawdatapath <- file.path(datapath, "raw-confidential")
confdatapath <- file.path(datapath, "confidential")
cleandatapath <- file.path(datapath, "clean")
metadatapath <- file.path(datapath, "metadata")

Minor thing

Let’s ensure that these paths all exist!

for (path in list(rawdatapath, confdatapath, cleandatapath, metadatapath)) {
  if (!dir.exists(path)) {
    dir.create(path, recursive = TRUE)
    message("Created directory: ", path)
  } else {
    message("Directory already exists: ", path)
  }
}

Directory already exists: /home/runner/work/tutorial-preserving-survey/tutorial-preserving-survey/data/raw-confidential

Created directory: /home/runner/work/tutorial-preserving-survey/tutorial-preserving-survey/data/confidential

Created directory: /home/runner/work/tutorial-preserving-survey/tutorial-preserving-survey/data/clean

Created directory: /home/runner/work/tutorial-preserving-survey/tutorial-preserving-survey/data/metadata

Loading the downloaded data into R

Any CSV reader works too, with some adjustments.

library(readr)
# discard the two Qualtrics metadata rows
data.raw <- read_csv(file.path(rawdatapath, filename), skip = 3,
            col_names = FALSE)
# read the header separately to get column names
header <- read.csv(file.path(rawdatapath, filename), 
            nrows=0)
colnames(data.raw) = colnames(header)

Loading the downloaded data into R

head(data.raw)

# A tibble: 6 × 17
  StartDate           EndDate             Status  Progress Duration..in.seconds.
  <dttm>              <dttm>              <chr>      <dbl>                 <dbl>
1 2025-07-01 11:13:44 2025-07-01 11:14:18 IP Add…      100                    34
2 2025-07-01 11:23:01 2025-07-01 11:23:28 IP Add…      100                    26
3 2025-07-01 11:26:40 2025-07-01 11:26:40 Survey…      100                     0
4 2025-07-01 11:27:12 2025-07-01 11:27:12 Survey…      100                     0
5 2025-07-01 11:30:26 2025-07-01 11:30:47 IP Add…      100                    21
6 2025-12-19 12:20:44 2025-12-19 12:21:19 IP Add…      100                    35
# ℹ 12 more variables: Finished <lgl>, RecordedDate <dttm>, ResponseId <chr>,
#   DistributionChannel <chr>, UserLanguage <chr>, consent <chr>, age_1 <dbl>,
#   gender <chr>, education <chr>, num_tabs_1 <dbl>, name_confidential <chr>,
#   number_confidential <chr>

Loading with `qualtRics` package

The qualtRics⁴ package can read a Qualtrics CSV export directly with read_survey():

library(qualtRics)

data.raw <- read_survey(file.path(rawdatapath, filename))

Loading with `qualTRics` package

head(data.raw)

# A tibble: 6 × 17
  StartDate           EndDate             Status Progress Duration (in seconds…¹
  <dttm>              <dttm>              <chr>     <dbl>                  <dbl>
1 2025-07-01 11:13:44 2025-07-01 11:14:18 IP Ad…      100                     34
2 2025-07-01 11:23:01 2025-07-01 11:23:28 IP Ad…      100                     26
3 2025-07-01 11:26:40 2025-07-01 11:26:40 Surve…      100                      0
4 2025-07-01 11:27:12 2025-07-01 11:27:12 Surve…      100                      0
5 2025-07-01 11:30:26 2025-07-01 11:30:47 IP Ad…      100                     21
6 2025-12-19 12:20:44 2025-12-19 12:21:19 IP Ad…      100                     35
# ℹ abbreviated name: ¹`Duration (in seconds)`
# ℹ 12 more variables: Finished <lgl>, RecordedDate <dttm>, ResponseId <chr>,
#   DistributionChannel <chr>, UserLanguage <chr>, consent <chr>, age_1 <dbl>,
#   gender <chr>, education <chr>, num_tabs_1 <dbl>, name_confidential <chr>,
#   number_confidential <chr>

Cleaning data

We filter the data to only include those who consented
We remove survey preview responses
(Optionally) remove responses that took place outside the relevant window.
Remove confidential data (variables name_confidential and number_confidential in our survey, for example).

data.confidential <- data.raw |>
    filter(consent == "Yes") |>
    filter(Status != "Survey Preview") |>
    filter(StartDate > QUALTRICS_STIME & EndDate < QUALTRICS_ETIME) |>
    select(StartDate,EndDate,Status,Finished,RecordedDate,
           ResponseId,consent,age_1,gender,education,
           num_tabs_1,name_confidential,number_confidential)
data.clean <- data.confidential %>%
  select(-name_confidential, -number_confidential)

Cleaning data by selection

We could also simply not select the confidential data if we don’t actually need it.

 data.clean <- data.raw |>
    filter(consent == "Yes") |>
    filter(Status != "Survey Preview") |>
    filter(StartDate > QUALTRICS_STIME & EndDate < QUALTRICS_ETIME) |>
    select(StartDate,EndDate,Status,Finished,RecordedDate,
           ResponseId,consent,age_1,gender,education,num_tabs_1)

Using confidential data

We could also (hypothetically) immediately compute variables that rely on confidential data.

# not run
 data.clean <- data.raw |>
    filter(consent == "Yes") |>
    filter(Status != "Survey Preview") |>
    filter(StartDate > QUALTRICS_STIME & EndDate < QUALTRICS_ETIME) |>
    select(StartDate,EndDate,Status,Finished,RecordedDate,
    ResponseId,consent,age_1,gender,education,num_tabs_1,
           gps_lat, gps_lon) |>
    mutate(distance = compute_distance_from_cornell(
    gps_lat,gps_lon,precision="100m")) |>
    select(-gps_lat, -gps_lon)

Saving confidential and clean data

save the confidential data to a clearly marked folder (J-PAL policy: an encrypted volume)
save the cleaned publishable data to a well-defined folder.

# save confidential data NOT for publishing, if needed
write.csv(data, file.path(confdatapath,"confidential_data.csv"), 
        row.names = FALSE)
saveRDS(data, file.path(confdatapath,"confidential_data.rds"))
# saving clean data for publishing
write.csv(data, file.path(cleandatapath,"clean_data.csv"), 
         row.names = FALSE)

Descriptive statistics

Now you are ready use your cleaned data for reproducible analyses!

Stepping it up

Stepping up the process

API,
checksums,
trusted systems
preservation

Credibility of the data flow

What could go wrong?

How to CERTIFY the full process?

Taking it a step further

Survey tool provider (Qualtrics, etc.) exports data, posts checksum
Survey tool provider exports data only to institution directly into trusted repository, researchers obtain data from there (with privacy protections)
Has been discussed by authors behind Data Colada
Don’t hold your breath…

Using automation

APIs

An API (Application Programming Interface) is a mechanism that enables two software components to communicate with each other
APIs can be used to request data or services and get responses without needing to know how the other program works internally
We will use APIs to streamline and automate the processing

Loading data from Qualtrics using an API

We need to know a few things:

the URL we want to use, defined by a generic part, and a survey specific part
these are public - no need for secrecy

# qualtrics URL components
QUALTRICS_FULL_URL <- "first part of survey URL"

QUALTRICS_SURVEY <- "second part of survey URL, usually starts with SV"

Loading data from Qualtrics using an API

We may want to limit the responses we download programmatically. This is not part of API, but of good programming practices.

# Keep only responses in the desired window of time
QUALTRICS_STIME <- ymd_hms("2025-07-01 00:00:01")
QUALTRICS_ETIME <- ymd_hms("2025-08-26 23:59:00")

Fetching the data with the API

The API call replaces the manual download from before:

  data.raw <- fetch_survey(surveyID = QUALTRICS_SURVEY, 
                           verbose = TRUE)

BUT: Privacy! Confidentiality!

Can anybody just download these data?

NO!

We need to authenticate, but not by entering a password manually.

That’s where the API token comes in.

Fetching the data with the API

We need to set an API token, then we can download this.

if (Sys.getenv("QUALTRICS_API_KEY") != "") {
  data.raw <- fetch_survey(surveyID = QUALTRICS_SURVEY, verbose = TRUE) 
} else {
  stop("Please set the QUALTRICS_API_KEY environment 
  variable to your API key.")
}

The rest of the pipeline is unchanged

data.raw now comes from the API instead of a downloaded file — but the cleaning and saving steps from before are exactly the same:

clean_data <- data.raw |>
    filter(consent == "Yes") |>
    filter(Status != "Survey Preview") |>
    filter(StartDate > QUALTRICS_STIME & EndDate < QUALTRICS_ETIME) |>
    select(StartDate,EndDate,Status,Finished,RecordedDate,
    ResponseId,consent,age_1,gender,education,num_tabs_1)

write.csv(clean_data, file.path(publicdata,"clean_data.csv"), 
row.names = FALSE)

Not unique to Qualtrics

And of course works just fine in Python (and via Python, could use Stata!)

Qualtrics and API tokens.

An API token is assigned to your Qualtrics account. Where do you find it?

Setting API tokens

Not specific to the Qualtrics API!

Set it manually:

Sys.setenv(QUALTRICS_API_KEY = "ab7ece8b")

Set it using environment variables stored outside your code (e.g., in .Renviron file) - good for testing

# This is .Renviron
QUALTRICS_API_KEY="ab7ece8b"

Setting API tokens

We want to automate on cloud servers!

Push these “secrets” to GitHub Secrets and load it in GitHub Actions [link]

Using API tokens

Now we need to make it available to our code (regardless of where it comes from)

# Here environment variables are read from .Renviron
QUALTRICS_API_KEY <- Sys.getenv("QUALTRICS_API_KEY")

Full code

Now this works both locally and on cloud servers without any manual interaction!

QUALTRICS_FULL_URL <- "first part of survey URL"
QUALTRICS_SURVEY <- "second part of survey URL, usually starts with SV"

if (Sys.getenv("QUALTRICS_API_KEY") != "") {
  data.raw <- fetch_survey(surveyID = QUALTRICS_SURVEY, verbose = TRUE) 
} else {
  stop("Please set the QUALTRICS_API_KEY environment 
  variable to your API key.")
}

Side-note: Qualtrics API credentials

Qualtrics API credentials cannot be restricted to a single survey. skip

Traditional Static API Tokens (`X-API-TOKEN`)

The Limit: You can only have one active static API token per user account at a time.
The Catch: If you go to your account settings and generate a new token for a second application, it will immediately overwrite and invalidate the old token, breaking your first application.

OAuth 2.0 Client Credentials

Separate, independent credentials for different applications:

Account Settings > Qualtrics IDs > OAuth Client Manager.
Click Create Client to generate unique sets of Client ID and Client Secret credentials.
You can create multiple clients for different applications
To revoke access for one app, you can delete its specific client without affecting the others.

BUT: API Key access ALL your surveys

Regardless of method, the API key can access ALL of your surveys!
There is NO way to restrict that natively.

That’s a problem.

Workaround: “Service Account”

You can create a specific user, say jpal-survey-user
Requires support from system administrator
Once the user is created, proceed as before, but when logged in as jpal-survey-user!
As YOU, share only the specific survey with jpal-survey-user and give it only the permissions it needs

It’s a bit more complicated…

Secrets

Secrets (Github version)

You will want to keep APIs key safe using GitHub Secrets.
Secrets allow you to store sensitive information in the repository environment.
Use the secret as an environment variable in the GitHub workflow file.

Storing secrets in `.Renviron` locally

You already have a .Renviron for local development:

QUALTRICS_API_KEY='something here'

Do not publish this file!
Do not commit it to Github!⁵

Storing secrets in Github

Enter them manually in the GitHub web interface
Use the .Renviron file to set the GitHub Actions secrets with the Github CLI:

gh secret set -f .Renviron


✓ Set Actions secret DATAVERSE_TOKEN for labordynamicsinstitute/tutorial-preserving-survey
✓ Set Actions secret QUALTRICS_BASE_URL for labordynamicsinstitute/tutorial-preserving-survey
✓ Set Actions secret DATAVERSE_SERVER for labordynamicsinstitute/tutorial-preserving-survey
✓ Set Actions secret QUALTRICS_API_KEY for labordynamicsinstitute/tutorial-preserving-survey
✓ Set Actions secret DATAVERSE_DATASET_DOI for labordynamicsinstitute/tutorial-preserving-survey

Using secrets in GitHub Actions

In GitHub workflows, set your environment variables:

echo "QUALTRICS_API_KEY=${{ secrets.QUALTRICS_API_KEY }}" >> $GITHUB_ENV

Using secrets in Github Actions

R code does not need to be adapted!

# Same R code as before!
if (Sys.getenv("QUALTRICS_API_KEY") != "") {
  data.raw <- fetch_survey(surveyID = QUALTRICS_SURVEY, verbose = TRUE) 
} else {
  stop("Please set the QUALTRICS_API_KEY environment 
  variable to your API key.")
}

Checksums

Checksums can be used to demonstrate consistency

A checksum is a single value calculated from a data file that can be used to verify the integrity of the file.
The sha256 algorithm is commonly used for this purpose.

Checksums can be used to demonstrate consistency

In R, the package digest⁶ is available.

library(digest)
# Calculate the checksum of a file
digest(trees)

[1] "370a7132861fb520bd721d9bcbe008a4"

digest(trees,algo="sha256")

[1] "70823d6c0cebdc582b388f3ec56930bd2ec5dd176272032d949b93319d74d17b"

Adding checksums to the data download

if (Sys.getenv("QUALTRICS_API_KEY") != "") {
  data.raw <- suppressMessages(fetch_survey(surveyID = QUALTRICS_SURVEY, verbose = FALSE))
  data.raw.sha256 <- digest::digest(data.raw, algo = "sha256")
  message("Checksum of the downloaded data: ", data.raw.sha256)
  # Write checksum to a file
  writeLines(data.raw.sha256, file.path(metadatapath, "data.raw.sha256"))
} else {
  stop("Please set the QUALTRICS_API_KEY environment 
  variable to your API key.")
}

Checksum of the downloaded data: a0e2146acc752debff66a670fe654c7618c45bacf6b8c634e58a21d5999fd222

How does that help?

Subsequent downloads can verify that the download is the same as originally downloaded!

# Read the original checksum from file
original.sha256 <- readLines(file.path(metadatapath, "data.raw.sha256"))
message("Original checksum from file: ", original.sha256)

Original checksum from file: a0e2146acc752debff66a670fe654c7618c45bacf6b8c634e58a21d5999fd222

# Redownload data
if (Sys.getenv("QUALTRICS_API_KEY") != "") {
  data.raw <- suppressMessages(fetch_survey(surveyID = QUALTRICS_SURVEY, verbose = FALSE))
  data.raw.sha256 <- digest::digest(data.raw, algo = "sha256")
  message("Checksum of the downloaded data: ", data.raw.sha256)
}

Checksum of the downloaded data: a0e2146acc752debff66a670fe654c7618c45bacf6b8c634e58a21d5999fd222

How does that help?

Subsequent downloads can verify that the download is the same as originally downloaded!

# Compare the checksums
if (original.sha256 == data.raw.sha256) {         
  message("Checksums match! Data integrity verified.")
} else {
  warning("Checksums do NOT match! Data may have changed/ been altered.")
}

Checksums match! Data integrity verified.

Preservation

Credibility of Survey Data

You run a study using the PSID. Do you trust the downloaded data?

Credibility of Government Data

You run a study using the PSID. Do you trust the downloaded data?
You use unemployment data for Angola through World Bank Data Bank. Do you trust the downloaded data?

Credibility of Government Data

You run a study using the PSID. Do you trust the downloaded data?
You use unemployment data for Argentina through World Bank Data Bank. Do you trust the downloaded data?

Credibility of Government Data

You run a study using the PSID. Do you trust the downloaded data?
You use unemployment data for United States through World Bank Data Bank. Do you trust the downloaded data?

Credibility of Researcher provided data

You run a study using the PSID. Do you trust the downloaded data?
You use inflation data for Argentina through a research deposit on Dataverse. Do you trust the downloaded data?

Let’s make you become a mini-PSID

Timing

Once you have registered your analysis plan - should the processing and analysis really change?

Once you have collected the data - is it really going to change?

Cycle

Modified Data and Workflow

Let’s consider the preservation part separately:

With reuse

Modified Data and Workflow

Proposal:

Preserve as you go

Modified

Modified Data and Workflow

Proposal:

Preserve as you go
Use what you preserve

Modified

Modified Data and Workflow

Proposal:

Preserve as you go
Use what you preserve

Note: Doubtful ethics of others…

I don’t want to be scooped!

Thus, I’m not going to publish my raw data just yet!

What is preservation

Preservation

Preservation != publication, != sharing
In fact, preservation may mean: not very accessible at all!
Preservation is intended to maintain data for tens, even hundreds of years
- Preservation may involve curation: active transformation of the data for improved accessibility

Stacks

What is publication

Publication typically involves making information about the data, as well as the data themselves, available to others.

Publication can initially mean that only metadata (information about the data) is published
In some cases, it may be that only metadata is ever published
But the metadata will point to how to access the data, how long the data will be preserved, and other salient facts

This all seems so complicated

I need to preserve my data for decades!
I need to manage the application process for decades!
Where do I get that DOI thing?
How to I get Google to index my data?

Options for Preservation (1)

Trusted Repositories

Journals and institutions have assessed a number of trusted repositories:

CoreTrustSeal has a certification process
re3data.org lists research data repositories
Nature, F1000Research, and PLOS have lists of trusted repositories.
Always check with your journal for specific restrictions or suggestions.

Options for Preservation (2)

Trusted Repositories

These generally include at least the following:
- Dryad Digital Repository
- figshare
- Harvard Dataverse
- ICPSR and OPENICPSR
- Open Science Framework
- Zenodo
- Country or region-specific repositories (that nevertheless generally accept depositors from anywhere): GESIS (Germany), Swedish National Data Service (SND), EASY (Netherlands), CSIRO (Australia), etc.
Many universities have formal document repositories that may be able to assume such a role; talk to your (data) librarian

What are NOT options for preservation

Github, Gitlab, Bitbucket, etc.
Dropbox, Box.com, Google Drive, etc.
Your personal website
Your university’s departmental website

404

404-gh

Options for Preservation

In one of my day jobs:

openicpsr

Options for Preservation with API

Dataverse https://demo.dataverse.org/dataverse/larstest

Also Zenodo https://zenodo.org

Getting started on Dataverse

We will NOT use the regular Dataverse; rather, we will test on the demo site.

This also works with Zenodo: https://sandbox.zenodo.org/
Check your URL bar! There’s often no other indication that this is not the real Zenodo or Dataverse!

A tutorial of sorts

Demo Dataverse for Lars https://demo.dataverse.org/dataverse/larstest

Remember the API tokens?

QUALTRICS_API_KEY='something here'
QUALTRICS_BASE_URL='url goes here'
DATAVERSE_TOKEN='token goes here'
DATAVERSE_SERVER='https://demo.dataverse.org'
DATAVERSE_DATASET_DOI='doi goes here'

We’re going to need the last three here!

Getting your API keys from Dataverse

Adding your API key to your `.Renviron`

QUALTRICS_API_KEY='something here'
QUALTRICS_BASE_URL='url goes here'
DATAVERSE_TOKEN='token goes here'
DATAVERSE_SERVER='https://demo.dataverse.org'
DATAVERSE_DATASET_DOI='doi goes here'

We need a “container”

Dataverse calls this a “dataset”.
A “dataset” can hold multiple files.
While this can be created via the API, I suggest doing it manually (once per project)

Fill in metadata

Uploading data to Dataverse

You could upload data manually, but this is about automation!
Now that the “container” is ready, we can upload data to it via the API.

Getting the Identifiers

QUALTRICS_API_KEY='something here'
QUALTRICS_BASE_URL='url goes here'
DATAVERSE_TOKEN='token goes here'
DATAVERSE_SERVER='https://demo.dataverse.org'
DATAVERSE_DATASET_DOI='doi goes here'

Additional controls

Licensing and permissions

Terms: For people who wish to download published files
Permissions: For fine-grained access over who can do what before publication

Permissions

You can designate

who can upload
who can edit metadata
who can publish

Terms and Licenses

Licenses are broad permissions on how to re-use
Often CC-BY, see https://creativecommons.org/cc-licenses/
Terms are more restrictive. Do third party data users
- need to contact somebody
- need to sign a data use agreement
- need IRB approval

Terms and Licenses

You can define custom terms (instead of a standard license)
Strongly suggest talking with University Counsel!

Back to practical matters

Uploading data to Dataverse via API

From terminal, with Python

python3 -m venv venv-dv
source venv-dv/bin/activate
source .Renviron
git clone https://github.com/larsvilhuber/dataverse-uploader
pip install -r dataverse-uploader/requirements.txt
python3 dataverse-uploader/dataverse.py \
   $DATAVERSE_TOKEN $DATAVERSE_SERVER \
   $DATAVERSE_DATASET_DOI . -d data/metadata

Results locally

Connecting to Dataverse server: https://demo.dataverse.org
Dataset DOI: doi:10.70122/FK2/EMAWKA
Dataset ID: 2695588
Found 0 existing files in dataset
Deleting 0 existing files...

Paths to upload: ['./data/metadata']

Scanning path: ./data/metadata
  Directory: ./data/metadata (contains 1 files)
    Uploading [1]: ./data/metadata/data.raw.sha256
      Filename: data.raw.sha256
      Directory label: './data/metadata'
      Response status: 200

Total files uploaded: 1

Done!

Results remotely

Filename is preserved
Pathname is preserved! data/metadata
MD5 checksum is also present! (less useful for the checksum file!)

Automatically from Github Actions

Code for uploading automatically

Voilà!

We have a workflow that can automatically download from Qualtrics, and in the same move, upload to Dataverse!

Possible improvements:

immediately publish (no human intervention)
upload the download logs and checksums together with the data
run on a schedule

Putting it all together

We already downloaded from the API, and have created checksum for raw data.
Let’s clean the data, and save local copies
Then upload the publishable data to Dataverse, and add metadata.

Cleaning the data

data.confidential <- data.raw |>
    filter(consent == "Yes") |>
    filter(Status != "Survey Preview") |>
    filter(StartDate > QUALTRICS_STIME & EndDate < QUALTRICS_ETIME) |>
    select(StartDate,EndDate,Status,Finished,RecordedDate,
           ResponseId,consent,age_1,gender,education,
           num_tabs_1,name_confidential,number_confidential)
data.clean <- data.confidential %>%
  select(-name_confidential, -number_confidential)

Save files

# save files in their locations
data.confidential.file <-
    file.path(confdatapath,"confidential_data.rds")
data.clean.file <-
    file.path(cleandatapath,"clean_data.rds")
saveRDS(data.confidential, 
        data.confidential.file)
saveRDS(data.clean, 
        data.clean.file)

… and create checksums

# Calculate checksums for the saved files
confidential_checksum <- 
     digest::digest(data.confidential.file, 
                    algo = "sha256", 
                    file = TRUE)
clean_checksum <- 
     digest::digest(data.clean.file, 
                    algo = "sha256", 
                    file = TRUE)
# Write checksums to files
writeLines(confidential_checksum, 
           file.path(metadatapath, "data.confidential.sha256"))
writeLines(clean_checksum, 
           file.path(metadatapath, "data.clean.sha256"))

Analysis

So here are the results so far (2026-06-18):

gender	Frequency	Percent
Male	5	45.45
Female	5	45.45
NA	1	9.09

By Education

education	Frequency	Percent
Secondary or less	1	9.09
Master’s degree	5	45.45
Professional or doctoral degree	4	36.36
NA	1	9.09

Age

Statistic	Value
Count	10.00
Mean	33.40
Median	28.50
Min	25.00
Max	62.00
Std. Dev.	11.48

Number of tabs open

Statistic	Value
Count	10.00
Mean	15.80
Median	16.00
Min	2.00
Max	27.00
Std. Dev.	9.19

State of the data directory

fs::dir_tree(datapath)

/home/runner/work/tutorial-preserving-survey/tutorial-preserving-survey/data
├── clean
│   └── clean_data.rds
├── confidential
│   └── confidential_data.rds
├── metadata
│   ├── data.clean.sha256
│   ├── data.confidential.sha256
│   └── data.raw.sha256
├── raw-confidential
│   ├── README.md
│   └── Testing+preservation_June+16,+2026_15.35.csv
├── tutorial-survey.csv
└── tutorial-survey.rds

Uploading to Dataverse

# System setup
# Need: apt install python3.10-venv
python3 -m venv venv-dv
source venv-dv/bin/activate
git clone https://github.com/larsvilhuber/dataverse-uploader
pip install -r dataverse-uploader/requirements.txt

# Do the uploads
python3 dataverse-uploader/dataverse.py \
   $DATAVERSE_TOKEN $DATAVERSE_SERVER \
   $DATAVERSE_DATASET_DOI . -d data/metadata
python3 dataverse-uploader/dataverse.py \
   $DATAVERSE_TOKEN $DATAVERSE_SERVER \
   $DATAVERSE_DATASET_DOI . \
   -d data/clean \
   --remove false

What is next in this space?

Using 3rd-party trusted systems

A sketch: Transparency Certified

https://transparency-certified.github.io/

Work in progress

Working with cascad, several INEXDA members, World Bank, various RDCs
Relying on external certification of data inputs (data catalogs with metadata, checksums)

Work in progress

SIVACOR: Scalable Infrastructure for Validation of Computational Social Science Research⁷

Does it prevent all fraud?

Does not prevent all fraud

Toronto researcher loses Ph.D.

MIT student makes up firm data

Back to Gino

A transparent, automated pipeline makes it much harder to manipulate data after collection, before analysis — exactly the Gino failure mode.
But it does not prevent fabricating data at the source, or other forms of misconduct.
Transparency and preservation raise the cost of fraud and the odds of detection — they are not a silver bullet.

The end! Thanks for your attention.

Footnotes

https://datacolada.org/109, https://datacolada.org/110, https://datacolada.org/111, https://datacolada.org/112, https://datacolada.org/114, https://datacolada.org/118
Jones, M. (2024). Introducing Reproducible Research Standards at the World Bank. Harvard Data Science Review, 6(4). https://doi.org/10.1162/99608f92.21328ce3
See my tutorial on handling of confidential data and reproducibility
Ginn J, O’Brien J, Silge J (2024). qualtRics: Download ‘Qualtrics’ Survey Data. R package version 3.2.1, https://github.com/ropensci/qualtRics, https://docs.ropensci.org/qualtRics/.
Add .Renviron to your .gitignore file to prevent it from being tracked by Git and accidentally pushed to GitHub.
Eddelbuettel D (2024). digest: Create Compact Hash Digests of R Objects. R package version 0.6.37, https://dirk.eddelbuettel.com/code/digest.html, https://github.com/eddelbuettel/digest.
Presentation

Preserving Survey Data

The problem of credibility

How can we know that a data source is reliably obtained?

Consider the case of Gino

The case of Gino

The case of Gino

The case of Gino

A generic survey workflow

Generic survey processing

Requiring transparency in academia

Where we are headed

Modern verification processes

Verifying transparency in academia

Verification by journals

Which journals

Verification by others

Verification by institutions

Outline of the tutorial

Basic

Expansion

In a nutshell

Goals

Some notes on Qualtrics

Some notes on removal of PII

Some notes on removal of PII

Some notes on Preservation vs. Sharing

Preservation

Sharing data

Test

Test

Demonstrating the core steps in R

The core steps

This is not a full Qualtrics tutorial!

Creating a survey in Qualtrics

Side-note: Survey definition from Qualtrics

Side-note to side-note: Confidential data in survey definitions!

Take our survey

Survey responses in Qualtrics

Downloading data

Download options

Downloaded data

Loading the downloaded data into R

Some data org hygiene

Minor thing

Loading the downloaded data into R

Loading the downloaded data into R

Loading with qualtRics package

Loading with qualTRics package

Cleaning data

Cleaning data by selection

Using confidential data

Saving confidential and clean data

Descriptive statistics

Stepping it up

Stepping up the process

Credibility of the data flow

What could go wrong?

How to CERTIFY the full process?

Taking it a step further

Using automation

APIs

Loading data from Qualtrics using an API

Loading data from Qualtrics using an API

Fetching the data with the API

BUT: Privacy! Confidentiality!

Fetching the data with the API

The rest of the pipeline is unchanged

Not unique to Qualtrics

Qualtrics and API tokens.

Setting API tokens

Setting API tokens

Using API tokens

Full code

Side-note: Qualtrics API credentials

Traditional Static API Tokens (X-API-TOKEN)

OAuth 2.0 Client Credentials

BUT: API Key access ALL your surveys

Workaround: “Service Account”

It’s a bit more complicated…

Secrets

Loading with `qualtRics` package

Loading with `qualTRics` package

Traditional Static API Tokens (`X-API-TOKEN`)

Storing secrets in `.Renviron` locally

Adding your API key to your `.Renviron`