Reproducibility, Replicability and Open Science at the Canadian Research Data Centre Network

Authors
Affiliations

Grant Gibson

Assistant Director: Research and Evaluation, Canadian Research Data Centre Network, McMaster University

S. Martin Taylor

Former Executive Director, Canadian Research Data Centre Network, McMaster University

Abstract

As an organization that exists at the nexus of academic research and public policy, the Canadian Research Data Centre Network (CRDCN) is committed to enabling projects at the forefront of the many disciplines represented among our users and to supporting robust analyses of public policy that can be used to improve outcomes for Canadians. In that regard, reproducibility and replicability figure as key criteria for assessing strength of research evidence. The CRDCN in partnership with Statistics Canada continues to advocate for and advance the ideal of open science within the constraints of providing access to confidential microdata.

Context

The CRDCN is a network of over 40 Canadian universities that support access to de-identified microdata from Statistics Canada at one of 33 university-based secure facilities (not dissimilar from the US RDC facilities). The data in the secure facilities are very detailed and include: question-by-question response data from nearly all social and industry surveys conducted by Statistics Canada; governmental program data from a wide variety of sources (Education, Health and Employment being the most prominent); administrative data from the Canada Revenue Agency and Immigration, Refugees and Citizenship Canada; and a large set of linkage keys through which many of these files can be used together. Academic researchers with suitable research questions can generally access the data at no cost to them following a review of a research proposal and obtaining a security clearance. Other researchers can similarly access the data on a cost-recovery basis. Core funding for the facilities and a small central staff are provided by federal grants, Statistics Canada, provincial support, and the partner universities. CRDCN researchers represent more than 25 disciplines, but are concentrated in economics, sociology, epidemiology, and public health.

The infrastructure that the CRDCN operates is currently undergoing a major transition. Whereas we now have site-specific servers that must be updated and maintained locally, two advanced research computing facilities have been built at the University of Waterloo in Ontario and Simon Fraser University in British Columbia. Following security checks and technical pilots they will serve as the backbone computational and storage infrastructure for a new national IT platform. With this transition, the computing environment will be standardized for all users, meaning that software and storage will be managed in a way that is the same regardless of where the user is accessing the data. This environment will also allow users working on most data files to work remotely should they so choose.

As a network dedicated to providing access to confidential microdata in a way that ensures the confidentiality of the respondents, CRDCN is cognizant of our role as it relates to the open science movement. While open data is an important pillar of open science, there are other elements that are not affected by the sensitivity of the data being used. To ensure the credibility and strength of network research, we continue to push for more open methods and tools and for greater data accessibility within the confines of the legislation which governs access. The value of the confidential microdata accessible through CRDCN for answering both scientific and policy questions is unparalleled, and despite not being fully open these data remain an important tool for rigorous and robust analyses.

For several reasons the CRDCN is a fertile test bed for providing strong evidence to advance knowledge and inform policy. Foremost is the intrinsic advantage of microdata - de-identified individual records from surveys and administrative data – which de facto are not subject to the various biases inherent in aggregate records. Second, the breadth and depth of the repository afford a very diverse set of opportunities for analysis over time, space, and content domain. Third, several of the surveys are conducted over repeated cycles so enabling time series analysis. Fourth, the number of linked data files – both between surveys and between surveys and administrative data – further extends the scope for analyses which map on to the complex processes typical of many of the economic, social and health systems that are primary foci of investigation. And fifth, is the depth and breadth of expertise represented in the 2400 CRDCN researchers drawn from 42 partner and affiliated institutions across Canada and embracing a wide range of disciplines in the social and health sciences. Underpinning these strengths is the intrinsic robustness of Statistics Canada microdata collection, curation, and management which provides a strong foundation for conducting reproducibility and replicability analyses.

Despite these advantages of the CRDCN repository as a fertile test bed, replicability and reproducibility studies are overwhelmingly the exception not the rule and, in the case of the latter, conspicuous by their total absence from the Network’s extensive bibliography of over 6000 publications. So, what are the challenges and/or disincentives that explain this apparent mismatch between the in-principle potential and yet the in-practice outcome? And how might those challenges be addressed?

Reproducibility and Replicability in a Secure Environment

Security provisions under the federal Statistics Act in Canada govern access to the confidential data made available through the CRDCN. As such, they cannot be open data, but this does not mean they cannot be fair (Wilkinson et al. 2016). Being able to find, access and use the data is, of course, a fundamental and necessary precondition to being able to either replicate or reproduce a study. Core functions of the CRDCN-StatCan partnership are to enable findability, accessibility, interoperability, and reuse given the constraints prescribed by the Statistics Act. In the context of confidential data this constitutes the most important activity since without it, no other steps towards reproduction or replication can proceed.

The code used to conduct the analysis becomes the second critical piece and it is here where our current efforts to foster reproducibility are targeted. Inside the secure facilities, statistical information compiled from the raw data are vetted by a staff person before the information is released to ensure that there are no residuals that might reveal an individual’s information and that the release satisfies the guidelines for statistical information releases for that dataset. Part of this process involves providing some supporting documentation which most often includes statistical code files used to create the output. This means that, for most projects, the process of building a set of code files that can recreate output is automatic and researchers have only to request vetting of their already existing code.

Building on this operational advantage to support reproducibility analysis has been the subject of many discussions by a CRDCN working group. While the policies that relate to vetting favour reproducibility, there are currently several policies that do not. Most critically, updates to datasets (as a result of rebasing or errors later discovered) are uploaded without offering a way to use the original data. Therefore, while a researcher might know that the data needed to conduct a reproduction of a study have been updated, they cannot be requested (this does not preclude the possibility of verifying that the original study is replicable with the updated data). Next, there is no mechanism present by which code can be shared between projects.

The computational environment inside the secure facilities is tightly controlled, and there is no way to request or move code between research contracts (even for the same researcher). The implication here is that to re-use code it must be produced in such a way that it can also pass a vetting process for release from the facility. While this type of coding is consistent with best practices, many code files will not meet this standard. If the standard is met, the code can be vetted and released at any time during the archival period of five years.

Reproducing an RDC study would follow a three-part process. First, the reproducer would gather the code either through contact with the author or by accessing code that the author had released as part of a replication package. Next, they would apply for access to the data, including writing a full research proposal. Finally, they would undergo a security screening process with fingerprinting, and pay cost-recovery if they were not part of the network. None of the current incentive structures in academia for reproduction or replications accommodate this level of required effort. Indeed, if we search the CRDCN bibliography for “replication”, “reproduction” or “reproduce” we find a few studies on fertility rates, and some on social inequalities, but of course these research themes are very different to research work attempting to reproduce or replicate research done in the RDC facilities. In 2022 a replication workshop was hosted by the Canadian Journal of Economics Data Editor (Marie Connolly) with some replications taking place inside the secure facilities. Her takeaway was that the level of administrative burden required to set up access for participants was not worthwhile when there are so many other papers at the journal that could be replicated without requiring that level of effort (even though Statistics Canada and the RDC program were supportive of the mission).

We are unlikely to see more institutional support for reproducibility in the CRDCN in the short term given that this existing mechanism by which research can be reproduced is largely unused. Whatever reproducibility looks like in the future, it will almost certainly need to be led by academics and scholarly societies. However, should the demand to reproduce or replicate research arise, satisfying it will require a major shift in the way the program is operated. With well over 150 peer-reviewed articles produced by the Research Data Centres every year (not to mention policy reports, theses, and other research outputs), operational realities preclude reproduction or replication of any more than a small fraction of the body of research.

That said, influential work conducted in the RDC is replicated and extended in the normal course of advancing knowledge. The most high-profile example is the replication and extension of the research on work requirements in the Self-Sufficiency Project by Riddel & Riddel (2014) from Journal of Public Economics published in 2020 in the Journal of Labor Economics. The experimental design of the policy was theoretically well-suited to learn about the labour market effects of welfare program incentives, but the replication included statistical controls to account for some changes to the post-intervention policy environment which overturned many of the original findings. The data management and universality of access make this process easier for academics with access to the RDCs. Simply put, the rest of the world is in many ways catching up to where the CRDCN researchers have always been. Where data deposits have relatively recently become routine and/or required, researchers using the CRDCN facilities have always been able to request the raw data used by another project.

Conclusion

As a network, CRDCN’s ambition is to encourage our researchers to move their work as far along the spectrum of open science as possible, ensuring that the data and research tools that can be made available are made available. At the same time, we advocate that Statistics Canada, as data provider, move as far along the spectrum of FAIR data as possible. We will continue to build capacity within our research community and particularly for early-career researchers in ways that will let them maximize the vision of “as open as possible”. To accomplish this, we provide training and guidance on reproducibility in secure environments, and our Replicability and Reproducibility working group continues to look for ways to enable a research culture at CRDCN that prioritizes open science including partnerships with journals and administrative support for reproducibility initiatives.

Bibliography

Wilkinson, Mark D, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1): 1–9.