Preserving Survey Data

Lars Vilhuber

Laurel Krovetz

2026-01-14

Quick tutorial on preserving survey data

In this presentation we’ll show you how to process and preserve survey data, in an automated and transparent fashion.
We’ll use an API to retrieve the data, show you how to clean and strip the data of confidential information and non-consenting responses, and use another API to preserve the data.

Quick tutorial on preserving survey data

We’ve created a short survey for demonstration that you can fill out if you want to contribute your own responses.

Take our survey

Goals

Create a survey (in Qualtrics) for data collection.
Load the latest responses from the server using an API
Clean and process the data to remove non-public data automatically.
Preserve shareable data in a trusted repository
Later, publish those data with a credible record of when it was first preserved!

Some notes

Create a survey (in Qualtrics) for data collection.

Load the latest responses from the server using an API

There is not much in this tutorial that requires Qualtrics.

You could do this with SurveyCTO, LimeSurvey, or any other system that has an API.
You could do this with Google Forms, if you have linked it to a Google Sheet.
You could do this with a lab experiment system that stores data in an SQL database

Clean and process the data to remove non-public data automatically.

It is important to remove any PII or confidential information as soon as possible.

That may not always be feasible. For instance, if you need geolocation to merge in contextual data, or compute distances, then some data processing may unavoidably require access to sensitive data.
But any data that is not needed should be removed early on.
This is not irreversible: if you later find that you need more data elements, you can always re-process the raw data, stored on Qualtrics, until your IRB requires that you delete those data.

Preserve shareable data in a trusted repository

Later, publish those data with a credible record of when it was first preserved!

It is important to distinguish

preserving data from
publishing data, and possibly
sharing data with collaborators

Preservation

Preservation != publication, != sharing
In fact, preservation may mean: not very accessible at all!
Preservation is intended to maintain data for tens, even hundreds of years
- Preservation may involve curation: active transformation of the data for improved accessibility

Opening up technical possibilities

How can we know that a data source is reliably obtained?

Consider the case of Gino

Francesca Gino

The case of Gino

Francesca Gino was a tenured professor at Harvard Business School, writing on honesty (!)

The case of Gino

Several articles were investigated by third parties (Data Colada, in particular ¹), and found to be problematic

The case of Gino

At least one of them had manipulated data AFTER it had been collected, BEFORE it had been analyzed.

Generic survey processing

Requiring transparency in academia

Verifying transparency in academia

Verification by journals

Provision (publication of materials) provides transparency
Verification (running the analysis again - computational reproducibility) compensates for mistrust/absence of trust

Which journals again

American Economic Association (8)
Econometric Society (3)
Canadian Journal of Economics (1)
Royal Economic Society (2)
Western Economic Association International (1)
European Economic Association (1)
Review of Economic Studies (1)
Journal of the European Economic Association (1)
Journal of Political Economy (3)
American Journal of Political Science (1)
American Political Science Review (1)

Verification by others

Pre-publication: cascad

Post-publication: Data Colada, Institute for Replication

Verification by institutions

World Bank ²

Taking it a step further

Taking it a step further

Has been discussed by authors behind Data Colada
Survey tool provider (Qualtrics, etc.) exports data, posts checksum
Survey tool provider exports data only to institution directly into trusted repository, researchers obtain data from there (with privacy protections)

Does not prevent all fraud

Toronto researcher loses Ph.D.

MIT student makes up firm data

How to document the full process?

A sketch: Transparency Certified

https://transparency-certified.github.io/

Work in progress

Working with cascad, several INEXDA members, and others
Relying on external certification of data inputs (data catalogs with metadata, checksums)

Creating a survey in Qualtrics

Creating a survey

This is not a full Qualtrics tutorial!

Creating a survey in Qualtrics

You’ll typically have access to a Qualtrics account through your university or organization. Then it is easy to construct a survey using the web tool.

Survey responses in Qualtrics

Responses can be easily checked at a glance in the Data and Analytics tab.

You can download data directly from this page
But we will NOT do that here, using instead a programmatic way to do so.

Why?

If you do this only once, downloading manually is fine.
Do it 2-3 times, you want to program it!

Side-note: Survey definition from Qualtrics

You should not forget to preserve your survey definition!

Download a qsf file to save and transfer survey structure to have a backup survey template. Export as .qsf in Tools in Qualtrics.
Can also export survey as Word doc in Tools in Qualtrics. Choose this option to get a well-formatted document.

Side-note to side-note: Confidential data in survey definitions!

It is possible that your survey definition itself contains information that you are not allowed to publish:

You might be running the survey with a firm, and the firm does not want to be identified
You are asking questions about specific products, and the product names are confidential

It is actually hard to de-identify a qsf file. We will not try to do this here, but you should be aware of this issue.

Process workflow

Processing data in R

Using an API to obtain data

Using APIs

An API is a mechanism that enables two software components to communicate with each other
APIs can be used to request data or services and get responses without needing to know how the other program works internally

Loading data from Qualtrics using an API

In order to always be analyzing the most up to date survey responses, load the data directly from the web using a Qualtrics API. We need a few pieces of information.

These parts are public. In fact, the window of time may be important for credibility.

# qualtrics URL components
QUALTRICS_FULL_URL <- "first part of survey URL"

QUALTRICS_SURVEY <- "second part of survey URL, usually starts with SV"

# Keep only responses in the desired window of time
QUALTRICS_STIME <- ymd_hms("2025-07-01 00:00:01")
QUALTRICS_ETIME <- ymd_hms("2025-08-26 23:59:00")

Loading data from Qualtrics using API

An API token is assigned to your Qualtrics account and is used to request data from a survey.

However, this token is your secret token and you don’t want this appearing in your published code!

Solutions

Set it manually:

Sys.setenv(QUALTRICS_API_KEY = "your-token")

Set it using environment variables stored outside your code (e.g., in .Renviron file)
- That is how we do it for this presentation during development!

QUALTRICS_API_KEY="your-token"

We can also push these “secrets” to GitHub Secrets and load it in GitHub Actions
- This is how we do it in this tutorial - see source code!

# Here environment variables are read from .Renviron
QUALTRICS_API_KEY <- Sys.getenv("QUALTRICS_API_KEY")

Loading data from Qualtrics using API

After setting the API token, we use it to pull the data from the survey server

if (Sys.getenv("QUALTRICS_API_KEY") != "") {
  data.raw <- fetch_survey(surveyID = QUALTRICS_SURVEY, verbose = TRUE) 
} else {
  stop("Please set the QUALTRICS_API_KEY environment 
  variable to your API key.")
}

Cleaning data

We filter the data to only include those who consented
We remove survey preview responses
(Optionally) remove responses that took place outside the relevant window.
Remove confidential data (variables name_confidential and number_confidential in our survey, for example).

 data <- data.raw |>
    filter(consent == "Yes") |>
    filter(Status != "Survey Preview") |>
    filter(StartDate > QUALTRICS_STIME & EndDate < QUALTRICS_ETIME) |>
    select(StartDate,EndDate,Status,Finished,RecordedDate,
    ResponseId,consent,age_1,gender,education,
    num_tabs_1,name_confidential,number_confidential)
clean_data <- data %>%
  select(-name_confidential, -number_confidential)

Cleaning data by selection

We could also simply not ever select the confidential data if we don’t actually need it.

 clean_data <- data.raw |>
    filter(consent == "Yes") |>
    filter(Status != "Survey Preview") |>
    filter(StartDate > QUALTRICS_STIME & EndDate < QUALTRICS_ETIME) |>
    select(StartDate,EndDate,Status,Finished,RecordedDate,
    ResponseId,consent,age_1,gender,education,num_tabs_1)

Using confidential data

We could also (hypothetically) immediately compute variables that rely on confidential data.

 clean_data <- data.raw |>
    filter(consent == "Yes") |>
    filter(Status != "Survey Preview") |>
    filter(StartDate > QUALTRICS_STIME & EndDate < QUALTRICS_ETIME) |>
    select(StartDate,EndDate,Status,Finished,RecordedDate,
    ResponseId,consent,age_1,gender,education,num_tabs_1,
           gps_lat, gps_lon) |>
    mutate(distance = compute_distance_from_cornell(
    gps_lat,gps_lon,precision="100m")) |>
    select(-gps_lat, -gps_lon)

Saving confidential and clean data

Lastly we need to save the confidential cleaned data that we don’t publish to one folder (if we need it), and save the cleaned publishable data to the relevant folder.

# save confidential data NOT for publishing, if needed
write.csv(data, file.path(confdata,"confidential_data.csv"), 
row.names = FALSE)

# saving clean data for publishing
write.csv(data, file.path(publicdata,"clean_data.csv"), 
row.names = FALSE)

Descriptive statistics

Now you are ready to publish and use your cleaned data for reproducible analyses!

Survey results

Now, let’s go look at our survey results.

QUALTRICS_STIME=2026-01-12 00:00:01

QUALTRICS_ETIME=2026-02-14 23:59:00


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%


── Column specification ────────────────────────────────────────────────────────
cols(
  StartDate = col_datetime(format = ""),
  EndDate = col_datetime(format = ""),
  Status = col_character(),
  Progress = col_double(),
  `Duration (in seconds)` = col_double(),
  Finished = col_logical(),
  RecordedDate = col_datetime(format = ""),
  ResponseId = col_character(),
  DistributionChannel = col_character(),
  UserLanguage = col_character(),
  consent = col_character(),
  age_1 = col_double(),
  gender = col_character(),
  education = col_character(),
  num_tabs_1 = col_double(),
  name_confidential = col_character(),
  number_confidential = col_character()
)

'StartDate', 'EndDate', and 'RecordedDate' were converted without a specific timezone
• To set a timezone, visit https://www.qualtrics.com/support/survey-platform/managing-your-account/
• Timezone information is under 'User Settings'
• See https://api.qualtrics.com/instructions/docs/Instructions/dates-and-times.md for more

gender	Frequency	Percent
Male	5	50
Female	5	50

education	Frequency	Percent
Secondary or less	1	10
Master’s degree	5	50
Professional or doctoral degree	4	40

Age

Statistic	Value
Count	9.00
Mean	34.11
Median	30.00
Min	25.00
Max	62.00
Std. Dev.	11.94

Number of tabs open

Statistic	Value
Count	10.00
Mean	15.80
Median	16.00
Min	2.00
Max	27.00
Std. Dev.	9.19

Keeping your secrets secret

Secrets

You will want to keep your API key safe using GitHub secrets.
Secrets allow you to store sensitive information in your repository environment. You create secrets to use in GitHub Actions workflows.
To make a secret available to an action, you must set the secret as an environment variable in your GitHub workflow file.

Storing secrets in `.Renviron` locally

You can store your Qualtrics secrets in an .Renviron file that you keep in the root of your project that contains the following information (fill in the true values):

QUALTRICS_API_KEY='something here'
QUALTRICS_BASE_URL='url goes here'
DATAVERSE_TOKEN='token goes here'
DATAVERSE_SERVER='https://demo.dataverse.org'
DATAVERSE_DATASET_DOI='doi goes here'

Do not publish this file!

Storing secrets in Github

You can use the .Renviron file to set the GitHub Actions secrets with

gh secret set -f .Renviron

instead of using the web interface! (You need the Github CLI)

Storing secrets in GitHub

Then in GitHub workflows you can set your environment variables to be used in your code, such as the API key:

echo "QUALTRICS_API_KEY=${{ secrets.QUALTRICS_API_KEY }}" >> $GITHUB_ENV

So in your publishable R code, you can simply refer to “QUALTRICS_API_KEY” in your code so as to not give away your API token.

You can check that the API key is set using the following code in R:

message(Sys.getenv("QUALTRICS_API_KEY"))

But don’t include this in your published output (i.e., slides like these!)

Archiving

Overview

Timing

Consider the following questions:

Once you have collected the data - is it really going to change?

Once you have registered your analysis plan - should the processing and analysis really change?

Cycle

Modified Data and Workflow

Let’s consider the preservation part separately:

With reuse

Modified Data and Workflow

Preserve as you go

Modified

Note: Doubtful ethics of others…

I don’t want to be scooped!

Thus, I’m not going to publish my raw data just yet!

What is preservation

Preservation

Preservation != publication, != sharing
In fact, preservation may mean: not very accessible at all!
Preservation is intended to maintain data for tens, even hundreds of years
- Preservation may involve curation: active transformation of the data for improved accessibility

Stacks

What is publication

Publication typically involves making information about the data, as well as the data themselves, available to others.

Publication can initially mean that only metadata (information about the data) is published
In some cases, it may be that only metadata is ever published
But the metadata will point to how to access the data, how long the data will be preserved, and other salient facts

This all seems so complicated

I need to preserve my data for decades!
I need to manage the application process for decades!
Where do I get that DOI thing?
How to I get Google to index my data?

Let’s start

scan

Options for Preservation (1)

Trusted Repositories

Journals and institutions have assessed a number of trusted repositories:

CoreTrustSeal has a certification process
re3data.org lists research data repositories
Nature, F1000Research, and PLOS have lists of trusted repositories.
Always check with your journal for specific restrictions or suggestions.

Options for Preservation (2)

Trusted Repositories

These generally include at least the following:
- Dryad Digital Repository
- figshare
- Harvard Dataverse
- ICPSR and OPENICPSR
- Open Science Framework
- Zenodo
- Country or region-specific repositories (that nevertheless generally accept depositors from anywhere): GESIS (Germany), Swedish National Data Service (SND), EASY (Netherlands), CSIRO (Australia), etc.
Many universities have formal document repositories that may be able to assume such a role; talk to your (data) librarian

What are NOT options for preservation

Github, Gitlab, Bitbucket, etc.
Dropbox, Box.com, Google Drive, etc.
Your personal website
Your university’s departmental website

404

404-gh

Options for Preservation

Here: Demo Dataverse for Lars https://demo.dataverse.org/dataverse/larstest

In one of my day jobs:

openicpsr

Getting started on Dataverse

We will NOT use the regular Dataverse; rather, we will test on the demo site.

This also works with Zenodo: https://sandbox.zenodo.org/
Check your URL bar! There’s often no other indication that this is not the real Zenodo or Dataverse!

A tutorial of sorts

Demo Dataverse for Lars https://demo.dataverse.org/dataverse/larstest

Remember the API tokens?

QUALTRICS_API_KEY='something here'
QUALTRICS_BASE_URL='url goes here'
DATAVERSE_TOKEN='token goes here'
DATAVERSE_SERVER='https://demo.dataverse.org'
DATAVERSE_DATASET_DOI='doi goes here'

We’re going to need the last three here!

Uploading data to Dataverse

From terminal:

python3 -m venv venv-dv
source venv-dv/bin/activate
source .Renviron
git clone https://github.com/larsvilhuber/dataverse-uploader
pip install -r dataverse-uploader/requirements.txt
python3 dataverse-uploader/dataverse.py \
   $DATAVERSE_TOKEN $DATAVERSE_SERVER \ 
   $DATAVERSE_DATASET_DOI . -d data

Automatically from Github Actions

Code for uploading automatically

Voilà!

We have a workflow that can automatically download from Qualtrics, and in the same move, upload to Dataverse!

Possible improvements:

immediately publish (no human intervention)
upload the download logs and checksums together with the data
run on a schedule

The end! Thanks for your attention.

Footnotes

https://datacolada.org/109, https://datacolada.org/110, https://datacolada.org/111, https://datacolada.org/112, https://datacolada.org/114, https://datacolada.org/118
Jones, M. (2024). Introducing Reproducible Research Standards at the World Bank. Harvard Data Science Review, 6(4). https://doi.org/10.1162/99608f92.21328ce3

Preserving Survey Data

Quick tutorial on preserving survey data

Quick tutorial on preserving survey data

Take our survey

Goals

Some notes

Preservation

Opening up technical possibilities

How can we know that a data source is reliably obtained?

Consider the case of Gino

The case of Gino

The case of Gino

The case of Gino

Generic survey processing

Generic survey processing

Requiring transparency in academia

Verifying transparency in academia

Verification by journals

Which journals again

Verification by others

Verification by institutions

Taking it a step further

Taking it a step further

Does not prevent all fraud

How to document the full process?

A sketch: Transparency Certified

Work in progress

Creating a survey in Qualtrics

Creating a survey

This is not a full Qualtrics tutorial!

Creating a survey in Qualtrics

Survey responses in Qualtrics

Side-note: Survey definition from Qualtrics

Side-note to side-note: Confidential data in survey definitions!

Process workflow

Processing data in R

Using an API to obtain data

Using APIs

Loading data from Qualtrics using an API

Loading data from Qualtrics using API

Loading data from Qualtrics using API

Cleaning data

Cleaning data by selection

Using confidential data

Saving confidential and clean data

Descriptive statistics

Survey results

Age

Number of tabs open

Keeping your secrets secret

Secrets

Storing secrets in .Renviron locally

Storing secrets in Github

Storing secrets in GitHub

Archiving

Overview

Timing

Modified Data and Workflow

Modified Data and Workflow

Note: Doubtful ethics of others…

What is preservation

Preservation

What is publication

This all seems so complicated

Let’s start

Options for Preservation (1)

Options for Preservation (2)

What are NOT options for preservation

Options for Preservation

Getting started on Dataverse

A tutorial of sorts

Remember the API tokens?

Uploading data to Dataverse

Automatically from Github Actions

Voilà!

The end! Thanks for your attention.

Footnotes

Storing secrets in `.Renviron` locally