2026-06-11

labordynamicsinstitute.github.io/reproducibility-confidential/presentation (PDF)
Journals require that you share your code and data in a replication package at the end of your research project.
Following some best practices from day 1 can not only help you prepare this package later, but also make you more productive researchers.
Following some best practices before releasing a package can avoid costly revisions.
When typing
Following some best practices before releasing a package can avoid costly revisions.
my coding AI suggested that I add
“and embarrassing retractions”…


You are your first replicator…
Good habits from day 1 pay off long before the journal asks:
merge_v3_final_FINAL.do quit academiaThe replication package layout is your working layout — adopt it from day 1:
confidential/ from public/ makes disclosure easycode/fsrdc/ from code/public/ shows what runs whererun.sh documents the order of operationsIf you can tell, from the path alone, whether a file can be released, you have organized well.
… and your FSRDC disclosure officer will be happy!
The final replication package
Contents of a package (context: FSRDC)
All code, whether used in RDC or not
All public data, whether used in RDC or not
NONE of the confidential data present in the RDC
gitgit (or any VCS) for code from the very first linegit!git convert?Some RDCs have no git. You can still version:
mycode-final_FINAL.R to be the code you are working on!01-prepare-data.R is your code, then back up to 01-prepare-data-2024-06-01.R.code-2024-06-01.zip and copy the zip files into an archive folderPlease do learn
git!
Go do the Carpentries’ Git lesson
Your README is not paperwork you write at the end — it is the memory of the project, kept as you go.
Full description as per the (template) README
Deep dive: README presentation
“But I did not choose my computer environment! They forced me!”
You still need to describe it.
Include the qsub files! (Or if you used qstata or such, describe that).
Contents of a package (context: FSRDC)
The package must run from raw inputs to every number in the paper — no manual steps, no hand edits.
estout, graph export, regsave — never copy-paste from the consolerun.sh / main.do) that runs it all, top to bottomAll code, whether used in RDC or not, that was used to manipulate the raw data.
Or maybe there is more code…
We will get back to that!
All code, whether used in RDC or not
Create all final figures and tables outside the RDC!
Why?
A license (licence) is an official permission or permit to do, use, or own something (as well as the document of that permission or permit).1 2
Preserving raw survey data early in research lifecycle (ethically!)
How did you get the data — and how can others?
In order to describe data availability, split into two:
Examples include
Examples include
- All the results in the paper use confidential microdata from the U.S. Census Bureau. To gain access to the Census microdata, follow the directions here on how to write a proposal for access to the data via a Federal Statistical Research Data Center: https://www.census.gov/ces/rdcresearch/howtoapply.html.
Examples include
- You must request the following datasets in your proposal:
- Longitudinal Business Database (LBD), 2002 and 2007
- Foreign Trade Database – Import (IMP), 2002 and 2007
- Annual Survey of Manufactures (ASM), including the Computer Network Use Supplement (CNUS), 1999
- […]
- Annual Survey of Magical Inputs (ASMI), 2002 and 2007
Examples include
- Reference
- “Technology and Production Fragmentation: Domestic versus Foreign Sourcing” by Teresa Fort, project number br1179 in the proposal. This will give you access to the programs and input datasets required to reproduce the results. Requesting a search of archives with the articles DOI (“10.1093/restud/rdw057”) should yield the same results.
Examples include
NOTE: Project-related files are available for 10 years as of 2015.
Examples include
The information used in the analysis combines several Danish administrative registers (as described in the paper). The data use is subject to the European Union’s General Data Protection Regulation(GDPR) per new Danish regulations from May 2018. The data are physically stored on computers at Statistics Denmark and, due to security considerations, the data may not be transferred to computers outside Statistics Denmark.
Examples include
Researchers interested in obtaining access to the register data employed in this paper are required to submit a written application to gain approval from Statistics Denmark. The application must include a detailed description of the proposed project, its purpose, and its social contribution, as well as a description of the required datasets, variables, and analysis population.
Examples include
Applications can be submitted by researchers who are affiliated with Danish institutions accepted by Statistics Denmark, or by researchers outside of Denmark who collaborate with researchers affiliated with these institutions.
(Example taken from Fadlon and Nielsen, AEJ:Applied 2021).
Also grant permission to your project files:
I grant any researchers with appropriate Census-approved project permission to use my exact research files provided that those files were among the ones that they requested when the approval was obtained (a Census Bureau requirement). These files can be found by searching for the DOI of [this archive/ this article] amongst backups/archives made in [month of archive].
Bureau of the Census. (release year). American Community Survey-Master Address File Crosswalk YYYY-YYZZ [Data File]. Federal Statistical Research Data Center [distributor].
Graf, Tobias; Grießemer, Stephan; Köhler, Markus; Lehnert, Claudia; Moczall, Andreas; Oertel, Martina; Schmucker, Alexandra; Schneider, Andreas; Seth, Stefan; Thomsen, Ulrich; vom Berge, Philipp (2023): “Weakly anonymous Version of the Sample of Integrated Labour Market Biographies (SIAB) – Version 7521 v1”. Research Data Centre of the Federal Employment Agency (BA) at the Institute for Employment Research (IAB). https:/doi.org/10.5164/IAB.SIAB7521.de.en.v1

Disclosure review is not a last-minute chore — design for it:
confidential/ folder is never disclosedStore secrets in environment variables or files that are not published.
Github secret scanning
Typed interactively (here for Linux and Mac)
(this is not recommended)
Same syntax used for contents of “dot-env” or “Renviron” files, and in fact bash or zsh startup files (.bash_profile, .zshrc)
Edit .Renviron (note the dot!) files:
Use the variables defined in .Renviron:
Loading regular environment variables:
Loading with dotenv
Yes, this also works in Stata
and via (what else) a user-written package for loading from files:
git, there are some other acceptable solutions//============ non-confidential parameters =========
include "config.do"
//============ confidential parameters =============
capture confirm file "$code/confidential/confparms.do"
if _rc == 0 {
// file exists
include "$code/confidential/confparms.do"
} else {
di in red "No confidential parameters found"
}
//============ end confidential parameters =========replace anoncounty=1 if county="Tompkins, NY").A really bad idea, but yes, you probably want to hide that.
So whether reasonable or not, this is an issue. How do you do that, without messing up the code, or spending hours redacting your code?
q2f and q3e are considered confidential by some rule, and that the minimum cell size 10 is also confidential.Only one line that does not contain “confidential” information.
A bad example, because literally making more work for you and for future replicators, is to manually redact the confidential information with text that is not legitimate code:
The redacted program above will no longer run, and will be very tedious to un-redact if a subsequent replicator obtains legitimate access to the confidential data.
Simply replacing the confidential data with replacement that are valid placeholders in the programming language of your choice is already better. Here’s the confidential version of the file:
//============ confidential parameters =============
global confseed 12345
global confpath "/data/economic/cmf2012"
global confprofit q2f
global confemploy q3e
global confmincell 10
//============ end confidential parameters =========
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count) n=$confemploy (mean) logprofit
drop if n<$confmincell
graph twoway n logprofitand this could be the released file, part of the replication package:
//============ confidential parameters =============
global confseed XXXX // a number
global confpath "XXXX" // a path that will be communicated to you
global confprofit XXX // Variable name for profit T26
global confemploy XXX // Variable name for employment T26
global confmincell XXX // a number
//============ end confidential parameters =========
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count) n=$confemploy (mean) logprofit
drop if n<$confmincell
graph twoway n logprofitWhile the code won’t run as-is, it is easy to un-redact, regardless of how many times you reference the confidential values, e.g., q2f, anywhere in the code.
Main file main.do:
//============ confidential parameters =============
capture confirm file "$code/confidential/confparms.do"
if _rc == 0 {
// file exists
include "$code/confidential/confparms.do""
} else {
di in red "No confidential parameters found"
}
//============ end confidential parameters =========
//============ non-confidential parameters =========
global safepath "$rootdir/releasable"
cap mkdir "$safepath"
//============ end parameters ======================Main file main.do (continued)
// :::: Process only if confidential data is present
capture confirm file "${confpath}/extract.dta"
if _rc == 0 {
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count) n=$confemploy (mean) logprofit
drop if n<$confmincell
save "${safepath}/figure1.dta", replace
} else { di in red "Skipping processing of confidential data" }
//============ at this point, the data is releasable ======
// :::: Process always
use "${safepath}/figure1.dta", clear
graph twoway n logprofit
graph export "${safepath}/figure1.pdf", replaceAuxiliary file $code/confidential/confparms.do" (not released)
Auxiliary file $code/include/confparms_template.do (this is released)
//============ confidential parameters =============
// Copy this file to $code/confidential/confparms.do and edit
global confseed XXXX // a number
global confpath "XXXX" // a path that will be communicated to you
global confprofit XXX // Variable name for profit T26
global confemploy XXX // Variable name for employment T26
global confmincell XXX // a number
//============ end confidential parameters =========Thus, the replication package would have:
data in your codeWe often see code that “fixes” problems in the data by hard-coding a mapping:
The information in columns name or county might be confidential.
By coding this information as part of your programs, you have made the code confidential!
As before, you might move this code into a separate file:
If you realize that the mapping is actually data, then treating it as any other data (much of which might also be confidential) is both
while being secure.
if (!file.exists("data/confidential/names_mapping.csv")) {
names_confidential %>%
left_join(read_csv("data/confidential/names_mapping.csv"), by = "name") %>%
# replace name with name_alt if the latter is not NA
mutate(name = if_else(!is.na(name_alt), name_alt, name)) %>%
# drop the name_alt column
select(-name_alt) -> names_clean
}Stata 17, R 4.5.3, Python 3.12.5)creturn listwhich <pkgname>sessionInfo()installed.packages()Project-local package environments keep one (sub-)project from breaking another, and travel with your code.
Public replication packages are preserved by journals or trusted archives.
What to do with the confidential data?
Treat the full replication package as a puzzle:
A) is one piece, containing code, public data, and documentationB) is the missing piece that completes the picture, and stays in the RDC
B.zip must be included in the public README.Run it all again, top to bottom!
Now you wait for the replicators to show up!

For each of the datasets, answer the following questions:

Stata
Stata
R
Use caching of downloaded data.
// :::: Process only if data are present
capture confirm file "${datapath}/dist_cepii.dta"
if _rc == 0 {
di in green "Data file is present, processing data"
} else {
di in red "Downloading data"
copy "$URL" "${datapath}/dist_cepii.dta", replace
}
//============ at this point, the data is available ======
// :::: Process always
use "${datapath}/dist_cepii.dta", clear
// do stuff....On to confidential data!
In Stata, we typically do not talk about environments, but the same basic structure applies: Stata searches along a set order for its commands.
Some commands are built into the executable (the software that is opened when you click on the Stata icon), but most other internal, and all external commands, are found in a search path.
sysdir directoriesThe default set of directories which can be searched, from a freshly installed Stata, can be queried with the sysdir command, and will look something like this:
adopath search orderThe search paths where Stata looks for commands is queried by adopath, and looks similar, but now has an order assigned to each entry:
When we install a package (net install, ssc install)3, only one of the (sysdir) paths is relevant: PLUS.
But the (PLUS) directory can be manipulated
* Set the root directory
global rootdir : pwd
* Define a location where we will hold all packages in THIS project (the "environment")
global adodir "$rootdir/ado"
* make sure it exists, if not create it.
cap mkdir "$adodir"
* Now let's simplify the adopath
* - remove the OLDPLACE and PERSONAL paths
* - NEVER REMOVE THE SYSTEM-WIDE PATHS - bad things will happen!
adopath - OLDPLACE
adopath - PERSONAL
* modify the PLUS path to point to our new location, and move it up in the order
sysdir set PLUS "$adodir"
adopath ++ PLUS
* verify the path
adopath* Set the root directory
global rootdir : pwd
* Define a location where we will hold all packages in THIS project (the "environment")
global adodir "$rootdir/ado"
* make sure it exists, if not create it.
cap mkdir "$adodir"
* Now let's simplify the adopath
* - remove the OLDPLACE and PERSONAL paths
* - NEVER REMOVE THE SYSTEM-WIDE PATHS - bad things will happen!
adopath - OLDPLACE
adopath - PERSONAL
* modify the PLUS path to point to our new location, and move it up in the order
sysdir set PLUS "$adodir"
adopath ++ PLUS
* verify the path
adopathSo it is no longer found. Why? Because we have removed the previous location (the old PLUS path) from the search sequence. It’s as if it didn’t exist.
When we now install reghdfe again:
We now see it in the project-specific directory, which we can distribute with the whole project.
Let’s imagine we need an older version of reghdfe.
Most package repositories are versioned:
Stata does not (as of 2024). But see the full site for one approach.
From the earlier desiderata of environments:
Some additional guidance can be found on the website of the Social Science Data Editors (URLs subject to change):
net install refererence. Strictly speaking, the location where ado packages are installed can be changed via the net set ado command, but this is rarely done in practice, and we won’t do it here.
