Confidential code#
What is confidential code?
In the United States, some variables on IRS databases are considered super-top-secret. So you can’t name that-variable-that-you-filled-out-on-your-Form-1040 in your analysis code of same data. (They are often referred to in jargon as “Title 26 variables”).
Your code contains the random seed you used to anonymize the sensitive identifiers. This might allow someone to reverse-engineer the anonymization, and is not a good idea to publish.
You used a look-up table hard-coded in your Stata code to anonymize the sensitive identifiers (replace
anoncounty=1 if county="Tompkins, NY"
). A really bad idea, but yes, you probably want to hide that.Your IT specialist or disclosure officer thinks publishing the exact path to your copy of the confidential 2010 Census data, e.g., “/data/census/2010”, is a security risk and refuses to let that code through.
You have adhered to disclosure rules, but for some reason, the precise minimum cell size is a confidential parameter.
So whether reasonable or not, this is an issue. How do you do that, without messing up the code, or spending hours redacting your code?
Example#
This will serve as an example. None of this is specific to Stata, and the solutions for R, Python, Julia, Matlab, etc. are all quite similar.
Assume that variables q2f
and q3e
are considered confidential by some rule, and that the minimum cell size 10
is also confidential.
set seed 12345
use q2f q3e county using "/data/economic/cmf2012/extract.dta", clear
gen logprofit = log(q2f)
by county: collapse (count) n=q3e (mean) logprofit
drop if n<10
graph twoway n logprofit
The final line is the only line of code that does not contain “confidential” information.
Bad#
A bad example, because it is literally making more work for you and for future replicators, is to manually redact the confidential information with text that is not legitimate code (do not do this!):
set seed NNNNN
use <removed vars> county using "<removed path>", clear
gen logprofit = log(XXXX)
by county: collapse (count) n=XXXX (mean) logprofit
drop if n<XXXX
graph twoway n logprofit
The redacted program above will no longer run, and will be very tedious to un-redact if a subsequent replicator obtains legitimate access to the confidential data.
Better#
Simply replace the confidential data with replacement that are valid placeholders in the programming language of your choice is already better. Here’s the confidential version of the file:
//============ confidential parameters =============
global confseed 12345
global confpath "/data/economic/cmf2012"
global confprofit q2f
global confemploy q3e
global confmincell 10
//============ end confidential parameters =========
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count) n=$confemploy (mean) logprofit
drop if n<$confmincell
graph twoway n logprofit
And the following could be the released file, part of the replication package:
//============ confidential parameters =============
global confseed XXXX // a number
global confpath "XXXX" // a path that will be communicated to you
global confprofit XXX // Variable name for profit T26
global confemploy XXX // Variable name for employment T26
global confmincell XXX // a number
//============ end confidential parameters =========
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count) n=$confemploy (mean) logprofit
drop if n<$confmincell
graph twoway n logprofit
While the code won’t run as-is, it is easy to un-redact, regardless of how many times you reference the confidential values, e.g., q2f
, anywhere in the code.
Best#
Main file
Conditional processing
Separate file for confidential parameters which can simply be excluded from disclosure request
Main file main.do
:
//============ confidential parameters =============
capture confirm file "$code/confidential/confparms.do"
if _rc == 0 {
// file exists
include "$code/confidential/confparms.do""
} else {
di in red "No confidential parameters found"
}
//============ end confidential parameters =========
//============ non-confidential parameters =========
global safepath "$rootdir/releasable"
cap mkdir "$safepath"
//============ end parameters ======================
// :::: Process only if confidential data is present
capture confirm file "${confpath}/extract.dta"
if _rc == 0 {
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count) n=$confemploy (mean) logprofit
drop if n<$confmincell
save "${safepath}/figure1.dta", replace
} else { di in red "Skipping processing of confidential data" }
//============ at this point, the data is releasable ======
// :::: Process always
use "${safepath}/figure1.dta", clear
graph twoway n logprofit
graph export "${safepath}/figure1.pdf", replace
Auxiliary file $code/confidential/confparms.do"
(not released):
//============ confidential parameters =============
global confseed 12345
global confpath "/data/economic/cmf2012"
global confprofit q2f
global confemploy q3e
global confmincell 10
//============ end confidential parameters =========
Auxiliary file $code/include/confparms_template.do
(this is released):
//============ confidential parameters =============
// Copy this file to $code/confidential/confparms.do and edit
global confseed XXXX // a number
global confpath "XXXX" // a path that will be communicated to you
global confprofit XXX // Variable name for profit T26
global confemploy XXX // Variable name for employment T26
global confmincell XXX // a number
//============ end confidential parameters =========
Thus, the best replication package would have:
...
code/main.do
README.md
include/confparms_template.do
releasable/figure1.dta
releasable/figure1.pdf