Data Sharing and Archiving for Reproducibility (RT2 2021)

Lars Vilhuber
2021-09-03

Cornell University

Overview

  • The Data Lifecycle (abbreviated) (15:00)
  • Data Sharing via Archives: Hands-on (20:00)
  • What if Data are Sensitive? (15:00)
  • Licensing for Ethical Data Sharing (10:00)

We need Data!

Background

Goals of this tutorial

  • Goal 1: Be able to curate the data necessary for reproducible analysis
  • Goal 2: Know when to do so
  • Goal 3: Choose license (while respecting ethics)

Goal 1: Elements: Data (where possible)

  • Old method: send the journal a ZIP file

  • Source: Your laptop

  • Destination: random file on a journal website

Questions/ What-ifs:

  • the data is not on your laptop?
    • too big
    • on server
    • a database
  • the data is not yours to send
    • confidentiality
    • proprietary
    • other licensing issues

Goal 1: Elements: Data (where possible)

  • Old method: send the journal a ZIP file

  • Source: Your laptop

  • Destination: random file on a journal website

Questions/ What-ifs:

  • how did the data get to your laptop?
  • how did the data get generated?

These are provenance questions.

Goal 1: Elements: Data (where possible)

  • Old method: send the journal a ZIP file

  • Source: Your laptop

  • Destination: random file on a journal website

Questions/ What-ifs:

  • is the ZIP file complete?
  • are the ZIP file contents curated (preserved)?
  • can the data be re-used?
  • can the data be properly attributed to the creator?
  • can the data be found independently of the article?

These are FAIR questions

FAIR Data Principles

  • Findable
  • Accessible
  • Interoperable
  • Re-Usable

Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18

FAIR Data Principles

The point of FAIR principles

“Good data management is not a goal in itself, but rather is the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process.”

“FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.”

(Wilkinson et al, 2016)

The Data Lifecycle

The Data Lifecycle

  • Amorphous thing… no clear consensus
  • Might involve destruction
    • What is the value of data?
    • Who decides on the value of data?

Industry-proposed data lifecycle: