Preserving Survey Data

Lars Vilhuber
Laurel Krovetz

2026-01-14

Quick tutorial on preserving survey data

  • In this presentation we’ll show you how to process and preserve survey data, in an automated and transparent fashion.

  • We’ll use an API to retrieve the data, show you how to clean and strip the data of confidential information and non-consenting responses, and use another API to preserve the data.

Quick tutorial on preserving survey data

We’ve created a short survey for demonstration that you can fill out if you want to contribute your own responses.

Take our survey

Goals

Some notes

There is not much in this tutorial that requires Qualtrics.

  • You could do this with SurveyCTO, LimeSurvey, or any other system that has an API.
  • You could do this with Google Forms, if you have linked it to a Google Sheet.
  • You could do this with a lab experiment system that stores data in an SQL database

It is important to remove any PII or confidential information as soon as possible.

  • That may not always be feasible. For instance, if you need geolocation to merge in contextual data, or compute distances, then some data processing may unavoidably require access to sensitive data.
  • But any data that is not needed should be removed early on.
  • This is not irreversible: if you later find that you need more data elements, you can always re-process the raw data, stored on Qualtrics, until your IRB requires that you delete those data.

It is important to distinguish

  • preserving data from
  • publishing data, and possibly
  • sharing data with collaborators

Preservation

  • Preservation != publication, != sharing
  • In fact, preservation may mean: not very accessible at all!
  • Preservation is intended to maintain data for tens, even hundreds of years
    • Preservation may involve curation: active transformation of the data for improved accessibility

Opening up technical possibilities

How can we know that a data source is reliably obtained?

Consider the case of Gino

Francesca Gino

The case of Gino

  • Francesca Gino was a tenured professor at Harvard Business School, writing on honesty (!)

The case of Gino

  • Several articles were investigated by third parties (Data Colada, in particular 1), and found to be problematic

Data manipulated

The case of Gino

  • At least one of them had manipulated data AFTER it had been collected, BEFORE it had been analyzed.

Data manipulation

Results of manipulation

Generic survey processing

Generic survey processing

Generic survey processing

Generic survey processing

Requiring transparency in academia

Generic survey processing

Verifying transparency in academia

Generic survey processing

Verification by journals

  • Provision (publication of materials) provides transparency
  • Verification (running the analysis again - computational reproducibility) compensates for mistrust/absence of trust

Which journals again

Verification by others

cascad

I4R

Verification by institutions

World Bank RRR

Taking it a step further

Survey flow

Taking it a step further

  • Has been discussed by authors behind Data Colada
  • Survey tool provider (Qualtrics, etc.) exports data, posts checksum
  • Survey tool provider exports data only to institution directly into trusted repository, researchers obtain data from there (with privacy protections)

Does not prevent all fraud

How to document the full process?

Survey flow

A sketch: Transparency Certified

https://transparency-certified.github.io/

Transparency Certified

Work in progress

  • Working with cascad, several INEXDA members, and others
  • Relying on external certification of data inputs (data catalogs with metadata, checksums)

Creating a survey in Qualtrics

Creating a survey

This is not a full Qualtrics tutorial!

Creating a survey in Qualtrics

You’ll typically have access to a Qualtrics account through your university or organization. Then it is easy to construct a survey using the web tool.

Survey responses in Qualtrics

Responses can be easily checked at a glance in the Data and Analytics tab.

  • You can download data directly from this page
  • But we will NOT do that here, using instead a programmatic way to do so.

Why?

  • If you do this only once, downloading manually is fine.
  • Do it 2-3 times, you want to program it!

Side-note: Survey definition from Qualtrics

You should not forget to preserve your survey definition!

  • Download a qsf file to save and transfer survey structure to have a backup survey template. Export as .qsf in Tools in Qualtrics.
  • Can also export survey as Word doc in Tools in Qualtrics. Choose this option to get a well-formatted document.

Side-note to side-note: Confidential data in survey definitions!

It is possible that your survey definition itself contains information that you are not allowed to publish:

  • You might be running the survey with a firm, and the firm does not want to be identified
  • You are asking questions about specific products, and the product names are confidential

It is actually hard to de-identify a qsf file. We will not try to do this here, but you should be aware of this issue.

Process workflow

Generic survey processing

Certified survey processing

Processing data in R

Using an API to obtain data

Using APIs

  • An API is a mechanism that enables two software components to communicate with each other
  • APIs can be used to request data or services and get responses without needing to know how the other program works internally

Loading data from Qualtrics using an API

In order to always be analyzing the most up to date survey responses, load the data directly from the web using a Qualtrics API. We need a few pieces of information.

These parts are public. In fact, the window of time may be important for credibility.

# qualtrics URL components
QUALTRICS_FULL_URL <- "first part of survey URL"

QUALTRICS_SURVEY <- "second part of survey URL, usually starts with SV"

# Keep only responses in the desired window of time
QUALTRICS_STIME <- ymd_hms("2025-07-01 00:00:01")
QUALTRICS_ETIME <- ymd_hms("2025-08-26 23:59:00")

Loading data from Qualtrics using API

An API token is assigned to your Qualtrics account and is used to request data from a survey.

However, this token is your secret token and you don’t want this appearing in your published code!

Solutions

  • Set it manually:
Sys.setenv(QUALTRICS_API_KEY = "your-token")
  • Set it using environment variables stored outside your code (e.g., in .Renviron file)
    • That is how we do it for this presentation during development!
QUALTRICS_API_KEY="your-token"
  • We can also push these “secrets” to GitHub Secrets and load it in GitHub Actions
    • This is how we do it in this tutorial - see source code!
# Here environment variables are read from .Renviron
QUALTRICS_API_KEY <- Sys.getenv("QUALTRICS_API_KEY")

Loading data from Qualtrics using API

After setting the API token, we use it to pull the data from the survey server

if (Sys.getenv("QUALTRICS_API_KEY") != "") {
  data.raw <- fetch_survey(surveyID = QUALTRICS_SURVEY, verbose = TRUE) 
} else {
  stop("Please set the QUALTRICS_API_KEY environment 
  variable to your API key.")
}

Cleaning data

  • We filter the data to only include those who consented
  • We remove survey preview responses
  • (Optionally) remove responses that took place outside the relevant window.
  • Remove confidential data (variables name_confidential and number_confidential in our survey, for example).
 data <- data.raw |>
    filter(consent == "Yes") |>
    filter(Status != "Survey Preview") |>
    filter(StartDate > QUALTRICS_STIME & EndDate < QUALTRICS_ETIME) |>
    select(StartDate,EndDate,Status,Finished,RecordedDate,
    ResponseId,consent,age_1,gender,education,
    num_tabs_1,name_confidential,number_confidential)
clean_data <- data %>%
  select(-name_confidential, -number_confidential)

Cleaning data by selection

We could also simply not ever select the confidential data if we don’t actually need it.

 clean_data <- data.raw |>
    filter(consent == "Yes") |>
    filter(Status != "Survey Preview") |>
    filter(StartDate > QUALTRICS_STIME & EndDate < QUALTRICS_ETIME) |>
    select(StartDate,EndDate,Status,Finished,RecordedDate,
    ResponseId,consent,age_1,gender,education,num_tabs_1)

Using confidential data

We could also (hypothetically) immediately compute variables that rely on confidential data.

 clean_data <- data.raw |>
    filter(consent == "Yes") |>
    filter(Status != "Survey Preview") |>
    filter(StartDate > QUALTRICS_STIME & EndDate < QUALTRICS_ETIME) |>
    select(StartDate,EndDate,Status,Finished,RecordedDate,
    ResponseId,consent,age_1,gender,education,num_tabs_1,
           gps_lat, gps_lon) |>
    mutate(distance = compute_distance_from_cornell(
    gps_lat,gps_lon,precision="100m")) |>
    select(-gps_lat, -gps_lon)  

Saving confidential and clean data

Lastly we need to save the confidential cleaned data that we don’t publish to one folder (if we need it), and save the cleaned publishable data to the relevant folder.

# save confidential data NOT for publishing, if needed
write.csv(data, file.path(confdata,"confidential_data.csv"), 
row.names = FALSE)

# saving clean data for publishing
write.csv(data, file.path(publicdata,"clean_data.csv"), 
row.names = FALSE)

Descriptive statistics

Now you are ready to publish and use your cleaned data for reproducible analyses!

Survey results

Now, let’s go look at our survey results.

QUALTRICS_STIME=2026-01-12 00:00:01
QUALTRICS_ETIME=2026-02-14 23:59:00

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

── Column specification ────────────────────────────────────────────────────────
cols(
  StartDate = col_datetime(format = ""),
  EndDate = col_datetime(format = ""),
  Status = col_character(),
  Progress = col_double(),
  `Duration (in seconds)` = col_double(),
  Finished = col_logical(),
  RecordedDate = col_datetime(format = ""),
  ResponseId = col_character(),
  DistributionChannel = col_character(),
  UserLanguage = col_character(),
  consent = col_character(),
  age_1 = col_double(),
  gender = col_character(),
  education = col_character(),
  num_tabs_1 = col_double(),
  name_confidential = col_character(),
  number_confidential = col_character()
)
'StartDate', 'EndDate', and 'RecordedDate' were converted without a specific timezone
• To set a timezone, visit https://www.qualtrics.com/support/survey-platform/managing-your-account/
• Timezone information is under 'User Settings'
• See https://api.qualtrics.com/instructions/docs/Instructions/dates-and-times.md for more
gender Frequency Percent
Male 5 50
Female 5 50
education Frequency Percent
Secondary or less 1 10
Master’s degree 5 50
Professional or doctoral degree 4 40

Age

Statistic Value
Count 9.00
Mean 34.11
Median 30.00
Min 25.00
Max 62.00
Std. Dev. 11.94

Number of tabs open

Statistic Value
Count 10.00
Mean 15.80
Median 16.00
Min 2.00
Max 27.00
Std. Dev. 9.19

Keeping your secrets secret

Secrets

  • You will want to keep your API key safe using GitHub secrets.

  • Secrets allow you to store sensitive information in your repository environment. You create secrets to use in GitHub Actions workflows.

  • To make a secret available to an action, you must set the secret as an environment variable in your GitHub workflow file.

Storing secrets in .Renviron locally

You can store your Qualtrics secrets in an .Renviron file that you keep in the root of your project that contains the following information (fill in the true values):

QUALTRICS_API_KEY='something here'
QUALTRICS_BASE_URL='url goes here'
DATAVERSE_TOKEN='token goes here'
DATAVERSE_SERVER='https://demo.dataverse.org'
DATAVERSE_DATASET_DOI='doi goes here'

Do not publish this file!

Storing secrets in Github

You can use the .Renviron file to set the GitHub Actions secrets with

gh secret set -f .Renviron

instead of using the web interface! (You need the Github CLI)

Storing secrets in GitHub

Then in GitHub workflows you can set your environment variables to be used in your code, such as the API key:

echo "QUALTRICS_API_KEY=${{ secrets.QUALTRICS_API_KEY }}" >> $GITHUB_ENV

So in your publishable R code, you can simply refer to “QUALTRICS_API_KEY” in your code so as to not give away your API token.

You can check that the API key is set using the following code in R:

message(Sys.getenv("QUALTRICS_API_KEY"))

But don’t include this in your published output (i.e., slides like these!)

Archiving

Overview

Timing

Consider the following questions:

Once you have collected the data - is it really going to change?

Once you have registered your analysis plan - should the processing and analysis really change?

Cycle

Modified Data and Workflow

Let’s consider the preservation part separately:

With reuse

Modified Data and Workflow

Preserve as you go

Modified

Note: Doubtful ethics of others…

I don’t want to be scooped!

Thus, I’m not going to publish my raw data just yet!

What is preservation

Preservation

  • Preservation != publication, != sharing
  • In fact, preservation may mean: not very accessible at all!
  • Preservation is intended to maintain data for tens, even hundreds of years
    • Preservation may involve curation: active transformation of the data for improved accessibility

Stacks

What is publication

Publication typically involves making information about the data, as well as the data themselves, available to others.

  • Publication can initially mean that only metadata (information about the data) is published
  • In some cases, it may be that only metadata is ever published
  • But the metadata will point to how to access the data, how long the data will be preserved, and other salient facts

This all seems so complicated

  • I need to preserve my data for decades!
  • I need to manage the application process for decades!
  • Where do I get that DOI thing?
  • How to I get Google to index my data?

Let’s start

scan

Options for Preservation (1)

Trusted Repositories

Journals and institutions have assessed a number of trusted repositories:

Options for Preservation (2)

Trusted Repositories

What are NOT options for preservation

  • Github, Gitlab, Bitbucket, etc.
  • Dropbox, Box.com, Google Drive, etc.
  • Your personal website
  • Your university’s departmental website

404

404-gh

Options for Preservation

Here: Demo Dataverse for Lars https://demo.dataverse.org/dataverse/larstest

In one of my day jobs:

openicpsr

Getting started on Dataverse

We will NOT use the regular Dataverse; rather, we will test on the demo site.

  • This also works with Zenodo: https://sandbox.zenodo.org/
  • Check your URL bar! There’s often no other indication that this is not the real Zenodo or Dataverse!

A tutorial of sorts

Remember the API tokens?

QUALTRICS_API_KEY='something here'
QUALTRICS_BASE_URL='url goes here'
DATAVERSE_TOKEN='token goes here'
DATAVERSE_SERVER='https://demo.dataverse.org'
DATAVERSE_DATASET_DOI='doi goes here'

We’re going to need the last three here!

Uploading data to Dataverse

  • From terminal:
python3 -m venv venv-dv
source venv-dv/bin/activate
source .Renviron
git clone https://github.com/larsvilhuber/dataverse-uploader
pip install -r dataverse-uploader/requirements.txt
python3 dataverse-uploader/dataverse.py \
   $DATAVERSE_TOKEN $DATAVERSE_SERVER \ 
   $DATAVERSE_DATASET_DOI . -d data

Automatically from Github Actions

Voilà!

We have a workflow that can automatically download from Qualtrics, and in the same move, upload to Dataverse!

Possible improvements:

The end! Thanks for your attention.

Footnotes

  1. https://datacolada.org/109, https://datacolada.org/110, https://datacolada.org/111, https://datacolada.org/112, https://datacolada.org/114, https://datacolada.org/118

  2. Jones, M. (2024). Introducing Reproducible Research Standards at the World Bank. Harvard Data Science Review, 6(4). https://doi.org/10.1162/99608f92.21328ce3