2023-09-26

Follow-up

Overview

Overview

  • Data Sharing (publishing) via Archives: Hands-on (20:00)
  • What if Data are Sensitive? (15:00)
  • Licensing for Ethical Data Sharing (10:00)

Goals of this tutorial

  • Goal 1: Be able to curate the data and code necessary for reproducible analysis
  • Goal 2: Know when to do so
  • Goal 3: Choose license (while respecting ethics)

Background

Goal 1

Elements: Data (where possible)

  • Old method: send the journal a ZIP file

  • Source: Your laptop

  • Destination: random file on a journal website

Questions/ What-ifs:

  • the data is not on your laptop?
  • too big
  • on server
  • a database
  • the data is not yours to send
    • confidentiality
    • proprietary
    • other licensing issues

Elements: Data (where possible)

  • Old method: send the journal a ZIP file

  • Source: Your laptop

  • Destination: random file on a journal website

Questions/ What-ifs:

  • how did the data get to your laptop?
  • how did the data get generated?

These are provenance questions.

Elements: Data (where possible)

  • Old method: send the journal a ZIP file

  • Source: Your laptop

  • Destination: random file on a journal website

Questions/ What-ifs:

  • is the ZIP file complete?
  • are the ZIP file contents curated (preserved)?
  • can the data be re-used?
  • can the data be properly attributed to the creator?
  • can the data be found independently of the article?

These are FAIR questions

FAIR Data Principles

  • Findable
  • Accessible
  • Interoperable
  • Re-Usable

Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18

FAIR Data Principles

The point of FAIR principles

“Good data management is not a goal in itself, but rather is the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process.”

“FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.”

(Wilkinson et al, 2016)

The Data Lifecycle

The Data Lifecycle

  • Amorphous thing… no clear consensus
  • Might involve destruction
    • What is the value of data?
    • Who decides on the value of data?

Industry-proposed data lifecycle:

Data Lifecycle

The Data Lifecycle

  • Amorphous thing… no clear consensus
  • Might involve destruction
    • What is the value of data?
    • Who decides on the value of data?

… which might really be a line

Linear cycle

The Data Lifecycle

  • Amorphous thing… no clear consensus
  • Might involve destruction
    • What is the value of data?
    • Who decides on the value of data?

A difficult question

National Academies Life-Cycle

National Academies of Sciences Engineering and Medicine}, Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs, Washington, DC: The National Academies Press, 2020. https://dx.doi.org/10.17226/25639

The Data Lifecycle

  • Amorphous thing… no clear consensus
  • Might involve destruction
  • Might involve re-use

Reuse

The Data Lifecycle

  • Amorphous thing… no clear consensus
  • Might involve destruction
  • Might involve re-use
    • which starts the cycle anew

Reuse again

Timing

Consider the following questions:

Once you have collected the data - is it really going to change?

Once you have registered your analysis plan - should the processing and analysis really change?

Cycle

Modified Data and Workflow

Let’s consider the preservation part separately:

With reuse

Modified Data and Workflow

Preserve as you go

Modified

Note: Doubtful ethics of others…

I don’t want to be scooped!

Thus, I’m not going to publish my raw data just yet!

What is preservation

Preservation

  • Preservation != publication, != sharing
  • In fact, preservation may mean: not very accessible at all!
  • Preservation is intended to maintain data for tens, even hundreds of years
    • Preservation may involve curation: active transformation of the data for improved accessibility

Stacks

What is publication

Publication typically involves making information about the data, as well as the data themselves, available to others.

  • Publication can initially mean that only metadata (information about the data) is published
  • In some cases, it may be that only metadata is ever published
  • But the metadata will point to how to access the data, how long the data will be preserved, and other salient facts

FAIR Principles

To be Findable:

  • F1. (meta)data are assigned a globally unique and eternally persistent identifier.
  • F2. data are described with rich metadata.
  • F3. (meta)data are registered or indexed in a searchable resource.
  • F4. metadata specify the data identifier.

To be Accessible:

  • A1 (meta)data are retrievable by their identifier using a standardized communications protocol.
  • A2 metadata are accessible, even when the data are no longer available.

FAIR Principles

To be Interoperable:

  • I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
  • I2. (meta)data use vocabularies that follow FAIR principles.
  • I3. (meta)data include qualified references to other (meta)data.

To be Re-usable:

  • R1. meta(data) have a plurality of accurate and relevant attributes.
  • R1.1. (meta)data are released with a clear and accessible data usage license.
  • R1.2. (meta)data are associated with their provenance.
  • R1.3. (meta)data meet domain-relevant community standards.

FAIR Metadata when data are not shareable

What kind of access conditions?

In decreasing order of “freely available”

  • Freely available (waive copyright)
  • Attribution requested (e.g., citation): for instance, “CC-BY”
  • Available only to university researchers
  • Available after embargo
  • Available after application process handled by a data provider
    • Only checks for legal compliance
  • Available with permission of the original researchers
    • Checks for why you use it
  • Only available if you are called “Lars”

This all seems so complicated

This all seems so complicated

  • I need to preserve my data for decades!
  • I need to manage the application process for decades!
  • Where do I get that DOI thing?
  • How to I get Google to index my data?

Let’s start

scan

Imperfect example

  • Original data
  • Analysis data
  • Analysis code
  • Analysis output

Options for Preservation (1)

Trusted Repositories

Journals and institutions have assessed a number of trusted repositories: - CoreTrustSeal has a certification process - re3data.org lists research data repositories - Nature, F1000Research, and PLOS have lists of trusted repositories. - Always check with your journal for specific restrictions or suggestions.

Options for Preservation (2)

What are NOT options for preservation

  • Github, Gitlab, Bitbucket, etc.
  • Dropbox, Box.com, Google Drive, etc.
  • Your personal website
  • Your university’s departmental website

404

404-gh

Options for Preservation

Here: Sandbox for Zenodo

zenodo

In one of my day jobs:

openicpsr

Getting started on Zenodo

Getting started on Zenodo

We will NOT use the regular Zenodo; rather, we will test in the Sandbox.

https://sandbox.zenodo.org/

Check your URL bar! There’s no other indication that this is not the real Zenodo!

Getting started on Zenodo

Do it now

Next

Follow-up