Data Sharing and Archiving for Reproducibility (Cornell 2024)

Lars Vilhuber
2024-11-13

Cornell University

Overview

  • The Data Lifecycle (abbreviated) (15:00)
  • Data Sharing via Archives: Hands-on (20:00)
  • What if Data are Sensitive? (15:00)
  • Licensing for Ethical Data Sharing (10:00)

We need Data!

Background

Goals of this tutorial

  • Goal 1: Be able to curate the data necessary for reproducible analysis
  • Goal 2: Know when to do so
  • Goal 3: Choose license (while respecting ethics)

Goal 1: Elements: Data (where possible)

  • Old method: send the journal a ZIP file

  • Source: Your laptop

  • Destination: random file on a journal website

Questions/ What-ifs:

  • the data is not on your laptop?
    • too big
    • on server
    • a database
  • the data is not yours to send
    • confidentiality
    • proprietary
    • other licensing issues

Goal 1: Elements: Data (where possible)

  • Old method: send the journal a ZIP file

  • Source: Your laptop

  • Destination: random file on a journal website

Questions/ What-ifs:

  • how did the data get to your laptop?
  • how did the data get generated?

These are provenance questions.

Goal 1: Elements: Data (where possible)

  • Old method: send the journal a ZIP file

  • Source: Your laptop

  • Destination: random file on a journal website

Questions/ What-ifs:

  • is the ZIP file complete?
  • are the ZIP file contents curated (preserved)?
  • can the data be re-used?
  • can the data be properly attributed to the creator?
  • can the data be found independently of the article?

These are FAIR questions

FAIR Data Principles

  • Findable
  • Accessible
  • Interoperable
  • Re-Usable

Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18

FAIR Data Principles

The point of FAIR principles

“Good data management is not a goal in itself, but rather is the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process.”

“FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.”

(Wilkinson et al, 2016)

The Data Lifecycle

The Data Lifecycle

  • Amorphous thing… no clear consensus
  • Might involve destruction
    • What is the value of data?
    • Who decides on the value of data?

Industry-proposed data lifecycle:

plot of chunk destruction1

The Data Lifecycle

  • Amorphous thing… no clear consensus
  • Might involve destruction
    • What is the value of data?
    • Who decides on the value of data?

… which might really be a line

plot of chunk destruction1-linear

The Data Lifecycle

  • Amorphous thing… no clear consensus
  • Might involve destruction
    • What is the value of data?
    • Who decides on the value of data?

A difficult question

National Academies Life-Cycle

National Academies of Sciences Engineering and Medicine}, Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs, Washington, DC: The National Academies Press, 2020. https://dx.doi.org/10.17226/25639

The Data Lifecycle

  • Amorphous thing… no clear consensus
  • Might involve destruction
  • Might involve re-use

plot of chunk cycle1

The Data Lifecycle

  • Amorphous thing… no clear consensus
  • Might involve destruction
  • Might involve re-use
    • which starts the cycle anew

plot of chunk cycle1-2

Consider the following questions

Once you have collected the data

  • is it really going to change?

Once you have registered your analysis plan

  • should the processing and analysis really change?

Cycle

Modified Data and Workflow

Let's consider the preservation part separately:

plot of chunk cycle1-archive

Modified Data and Workflow

Preserve as you go

Modified

Improved preservation and consistency

  • Use your own archives!
  • Ability to share earlier (multi-paper projects!)

plot of chunk cycle3

Doubtful ethics of others...

I don’t want to be scooped!

Thus, I’m not going to publish my raw data just yet!

What is preservation

  • Preservation != publication
  • In fact, preservation may mean: not very accessible at all!
  • Preservation is intended to maintain data for tens, even hundreds of years
    • Preservation may involve curation: active transformation of the data for improved accessibility

Stacks

What is publication

Publication typically involves making information about the data, as well as the data themselves, available to others.

  • Publication can initially mean that only metadata (information about the data) is published
  • In some cases, it may be that only metadata is ever published
  • But the metadata will point to how to access the data, how long the data will be preserved, and other salient facts

FAIR Principles

To be Findable:

  • F1. (meta)data are assigned a globally unique and eternally persistent identifier.
  • F2. data are described with rich metadata.
  • F3. (meta)data are registered or indexed in a searchable resource.
  • F4. metadata specify the data identifier.

To be Accessible:

  • A1 (meta)data are retrievable by their identifier using a standardized communications protocol.
  • A2 metadata are accessible, even when the data are no longer available.

FAIR Principles

To be Interoperable:

  • I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
  • I2. (meta)data use vocabularies that follow FAIR principles.
  • I3. (meta)data include qualified references to other (meta)data.

To be Re-usable:

  • R1. meta(data) have a plurality of accurate and relevant attributes.
  • R1.1. (meta)data are released with a clear and accessible data usage license.
  • R1.2. (meta)data are associated with their provenance.
  • R1.3. (meta)data meet domain-relevant community standards.

Some examples

FAIR Metadata

Interoperable: Structured metadata about the data

icpsr metadata

Accessible: Structured metadata about the deposit

icpsr DC metadata

FAIR Metadata

Findable: persistent identifier, indexed

icpsr metadata

Re-usable: License permits it!

(this was actually hidden in the metadata)

FAIR Metadata when data are not shareable

Access conditions involve application process.

iab Access metadata

But information ABOUT the access process (=metadata) is available.

What kind of access conditions?

In decreasing order of “freely available”

  • Freely available (waive copyright)
  • Attribution requested (e.g., citation): for instance, “CC-BY”
  • Available only to university researchers
  • Available after embargo
  • Available after application process handled by a data provider
    • Only checks for legal compliance
  • Available with permission of the original researchers
    • Checks for why you use it
  • Only available if you are called “Lars”

This all seems so complicated

  • I need to preserve my data for decades!
  • I need to manage the application process for decades!
  • Where do I get that DOI thing?
  • How to I get Google to index my data?

Let's start

scan

Toy Example

How many browser tabs do you have open?

https://forms.gle/FEpF9RVq56XmesWF9

Survey

  • Survey forms
  • Metadata
  • Sample data
  • Actual data

Safeguarding scientific output

The role of journals is to provide a permanent record of scientific knowledge.

  • how reliable is that record?
  • where are journals stored?
  • what if the information is not in a journal?

old library

Safeguarding scientific output

  • journals disappear, as do websites
  • paper journals are stored in libraries
  • e-journals in a system called LOCKSS = Lots of Copies Keep Stuff Safe
  • data should be stored in repositories

stacks

These are still fallible

scan

Options for Preservation

Trusted Repositories

Journals and institutions have assessed a number of trusted repositories:

What are NOT options for preservation

  • Github, Gitlab, Bitbucket, etc.
  • Dropbox, Box.com, Google Drive, etc.
  • Your personal website
  • Your university's departmental website

404

404-gh

Options for Preservation

Here: Sandbox for Zenodo

zenodo

In one of my day jobs:

openicpsr

Getting started on Zenodo

We will NOT use the regular Zenodo; rather, we will test in the Sandbox.

https://sandbox.zenodo.org/

Check your URL bar! There's no other indication that this is not the real Zenodo!

Survey: Phase 1

node1

  • Survey forms - ✔️
  • Metadata - ✔️
  • Sample data - ✔️
  • Actual data

Survey: Phase 1

node1

  • Survey forms - ✔️
  • Metadata - ✔️
  • Sample data - ✔️
  • Actual data

Since I have already defined the survey, and created some test data, I can … publish it!

  • I could embargo it so that my survey respondents don't see the full survey (randomization)
  • I could publish it so that I can show I'm serious to my survey respondents!

Do it now

scan

Survey: Phase 1

node1

  • Survey forms - ✔️
  • Metadata - ✔️
  • Sample data - ✔️
  • Actual data

zenodo filelist

Survey: Phase 1

node1

  • Survey forms - ✔️✔️
  • Metadata - ✔️✔️
  • Sample data - ✔️✔️
  • Actual data

zenodo filelist

An aside

Goal: Robustness and automation - getting close to push-button reproducibility

  • (Advanced features of Git(hub,lab) allows us to implement and test that)

Goal: Correctly document reproducible research

  • (also respond to thesis advisor, referree, editor, curious journalist asking the question “what has changed”)

changes

Next step

License