
2026-01-14
In this presentation we’ll show you how to process and preserve survey data, in an automated and transparent fashion.
We’ll use an API to retrieve the data, show you how to clean and strip the data of confidential information and non-consenting responses, and use another API to preserve the data.
We’ve created a short survey for demonstration that you can fill out if you want to contribute your own responses.

There is not much in this tutorial that requires Qualtrics.
It is important to remove any PII or confidential information as soon as possible.
It is important to distinguish
Francesca Gino











Toronto researcher loses Ph.D.

MIT student makes up firm data


You’ll typically have access to a Qualtrics account through your university or organization. Then it is easy to construct a survey using the web tool.
Responses can be easily checked at a glance in the Data and Analytics tab.
Why?
You should not forget to preserve your survey definition!
qsf file to save and transfer survey structure to have a backup survey template. Export as .qsf in Tools in Qualtrics.Tools in Qualtrics. Choose this option to get a well-formatted document.It is possible that your survey definition itself contains information that you are not allowed to publish:
It is actually hard to de-identify a
qsffile. We will not try to do this here, but you should be aware of this issue.


In order to always be analyzing the most up to date survey responses, load the data directly from the web using a Qualtrics API. We need a few pieces of information.
These parts are public. In fact, the window of time may be important for credibility.
An API token is assigned to your Qualtrics account and is used to request data from a survey.
However, this token is your secret token and you don’t want this appearing in your published code!
Solutions
.Renviron file)
After setting the API token, we use it to pull the data from the survey server
name_confidential and number_confidential in our survey, for example). data <- data.raw |>
filter(consent == "Yes") |>
filter(Status != "Survey Preview") |>
filter(StartDate > QUALTRICS_STIME & EndDate < QUALTRICS_ETIME) |>
select(StartDate,EndDate,Status,Finished,RecordedDate,
ResponseId,consent,age_1,gender,education,
num_tabs_1,name_confidential,number_confidential)
clean_data <- data %>%
select(-name_confidential, -number_confidential)We could also simply not ever select the confidential data if we don’t actually need it.
We could also (hypothetically) immediately compute variables that rely on confidential data.
clean_data <- data.raw |>
filter(consent == "Yes") |>
filter(Status != "Survey Preview") |>
filter(StartDate > QUALTRICS_STIME & EndDate < QUALTRICS_ETIME) |>
select(StartDate,EndDate,Status,Finished,RecordedDate,
ResponseId,consent,age_1,gender,education,num_tabs_1,
gps_lat, gps_lon) |>
mutate(distance = compute_distance_from_cornell(
gps_lat,gps_lon,precision="100m")) |>
select(-gps_lat, -gps_lon) Lastly we need to save the confidential cleaned data that we don’t publish to one folder (if we need it), and save the cleaned publishable data to the relevant folder.
Now you are ready to publish and use your cleaned data for reproducible analyses!
Now, let’s go look at our survey results.
QUALTRICS_STIME=2026-01-12 00:00:01
QUALTRICS_ETIME=2026-02-14 23:59:00
|
| | 0%
|
|======================================================================| 100%
── Column specification ────────────────────────────────────────────────────────
cols(
StartDate = col_datetime(format = ""),
EndDate = col_datetime(format = ""),
Status = col_character(),
Progress = col_double(),
`Duration (in seconds)` = col_double(),
Finished = col_logical(),
RecordedDate = col_datetime(format = ""),
ResponseId = col_character(),
DistributionChannel = col_character(),
UserLanguage = col_character(),
consent = col_character(),
age_1 = col_double(),
gender = col_character(),
education = col_character(),
num_tabs_1 = col_double(),
name_confidential = col_character(),
number_confidential = col_character()
)
'StartDate', 'EndDate', and 'RecordedDate' were converted without a specific timezone
• To set a timezone, visit https://www.qualtrics.com/support/survey-platform/managing-your-account/
• Timezone information is under 'User Settings'
• See https://api.qualtrics.com/instructions/docs/Instructions/dates-and-times.md for more
| gender | Frequency | Percent |
|---|---|---|
| Male | 5 | 50 |
| Female | 5 | 50 |
| education | Frequency | Percent |
|---|---|---|
| Secondary or less | 1 | 10 |
| Master’s degree | 5 | 50 |
| Professional or doctoral degree | 4 | 40 |
| Statistic | Value |
|---|---|
| Count | 9.00 |
| Mean | 34.11 |
| Median | 30.00 |
| Min | 25.00 |
| Max | 62.00 |
| Std. Dev. | 11.94 |
| Statistic | Value |
|---|---|
| Count | 10.00 |
| Mean | 15.80 |
| Median | 16.00 |
| Min | 2.00 |
| Max | 27.00 |
| Std. Dev. | 9.19 |
You will want to keep your API key safe using GitHub secrets.
Secrets allow you to store sensitive information in your repository environment. You create secrets to use in GitHub Actions workflows.
To make a secret available to an action, you must set the secret as an environment variable in your GitHub workflow file.
.Renviron locallyYou can store your Qualtrics secrets in an .Renviron file that you keep in the root of your project that contains the following information (fill in the true values):
Do not publish this file!
You can use the .Renviron file to set the GitHub Actions secrets with
instead of using the web interface! (You need the Github CLI)
Then in GitHub workflows you can set your environment variables to be used in your code, such as the API key:
So in your publishable R code, you can simply refer to “QUALTRICS_API_KEY” in your code so as to not give away your API token.
You can check that the API key is set using the following code in R:
But don’t include this in your published output (i.e., slides like these!)
Consider the following questions:
Once you have collected the data - is it really going to change?
Once you have registered your analysis plan - should the processing and analysis really change?

Let’s consider the preservation part separately:

Preserve as you go

I don’t want to be scooped!
Thus, I’m not going to publish my raw data just yet!

Publication typically involves making information about the data, as well as the data themselves, available to others.

Trusted Repositories
Journals and institutions have assessed a number of trusted repositories:
Trusted Repositories
These generally include at least the following:
Many universities have formal document repositories that may be able to assume such a role; talk to your (data) librarian


Here: Demo Dataverse for Lars https://demo.dataverse.org/dataverse/larstest

In one of my day jobs:

We will NOT use the regular Dataverse; rather, we will test on the demo site.
We’re going to need the last three here!
We have a workflow that can automatically download from Qualtrics, and in the same move, upload to Dataverse!
Possible improvements:
https://datacolada.org/109, https://datacolada.org/110, https://datacolada.org/111, https://datacolada.org/112, https://datacolada.org/114, https://datacolada.org/118
Jones, M. (2024). Introducing Reproducible Research Standards at the World Bank. Harvard Data Science Review, 6(4). https://doi.org/10.1162/99608f92.21328ce3
