Also see https://labordynamicsinstitute.github.io/SyntheticLEAP.
| Article | |
| arXiv | |
| Data and code |
Data on businesses collected by statistical agencies are challenging to protect. Many businesses have unique characteristics, and distributions of employment, sales, and profits are highly skewed. Attackers wishing to conduct identification attacks often have access to much more information than for any individual. As a consequence, most disclosure avoidance mechanisms fail to strike an acceptable balance between usefulness and confidentiality protection. Detailed aggregate statistics by geography or detailed industry classes are rare, public-use microdata on businesses are virtually inexistant, and access to confidential microdata can be burdensome. Synthetic microdata have been proposed as a secure mechanism to publish microdata, as part of a broader discussion of how to provide broader access to such datasets to researchers.
In this article, we document an experiment to create analytically valid synthetic data, using the exact same model and methods previously employed for the United States, for data from two different countries: Canada (LEAP) and Germany (BHP). We assess utility and protection, and provide an assessment of the feasibility of extending such an approach in a cost-effective way to other data.
@article{alamdostiedrechslervilhuber2020,
title = {Applying Data Synthesis for Longitudinal Business Data across Three Countries},
author = {Alam, M. Jahangir and Dostie, Benoit and Drechsler, J{\"o}rg and Vilhuber, Lars},
year = {2020},
volume = {21},
pages = {212--236},
doi = {10.21307/stattrans-2020-039},
journal = {Statistics in Transition New Series},
language = {en},
number = {4},
}
The key data sources used in the article are described and cited in the article. All source data is confidential, available on restricted-access servers.
A 50% sample of the BHP is accessible to external researchers through the FDZ (Research Data Center) of the IAB. Applicants must fill out a form, subject to approval, and can access the data via the various access mechanisms of the FDZ, including physical locations in Germany, elsewhere in Europe, and North America.
Synthetic data from both confidential data were never released to the public, and are accessible only via the same access mechanisms as above. A related synthetic LEAP was made available through the Canadian Research Data Center system, as part of a pilot program, to prepare access to the confidential data. We are not aware of current access. The outcomes of the pilot program have not been made public yet.
As a small part of the post-processing, we count the (theoretical) number of Canadian NAICS industry groups (Statistics Canada, 2012). The file can be downloaded from https://www.statcan.gc.ca/eng/subjects/standard/naics/2012/index.
## Parsed with column specification:
## cols(
## Level = col_double(),
## `Hierarchical structure` = col_character(),
## Code = col_character(),
## `Class title` = col_character(),
## Superscript = col_character(),
## `Class definition` = col_character()
## )
| Level | Hierarchical structure | Code | Class title | Superscript | Class definition | |
|---|---|---|---|---|---|---|
| Min. :1.000 | Length:2078 | Length:2078 | Length:2078 | Length:2078 | Length:2078 | |
| 1st Qu.:4.000 | Class :character | Class :character | Class :character | Class :character | Class :character | |
| Median :4.000 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | |
| Mean :4.161 | NA | NA | NA | NA | NA | |
| 3rd Qu.:5.000 | NA | NA | NA | NA | NA | |
| Max. :5.000 | NA | NA | NA | NA | NA |
The analytic outcomes described in the article were released through the respective disclosure avoidance mechanisms, subject to disclosure avoidance procedures of each statistical institution. These outcomes, as figures, CSV files, and others, are available in this repository. Some were extracted from figures or released tables by the programs in this directory.
The data directory contains materials released from Statistics Canada and the IAB. It is mostly highly aggregated synthetic data, as well as regression coefficients. All data releases were authorized by the respective statistical agencies.
The graphs contains mostly pre-rendered figures released as part of the agency data releases. The programs to generate these figures can be found in programs/Canada, and were run on the confidential data.
The graphs contains GPH (Stata) format files, the source for the PDFs in the graphs directory. The programs to generate these figures can be found in programs/Canada, and were run on the confidential data.
Programs for analysis (programs/Canada, used for both Canada and Germany), and post-processing (programs/Post) are provided.
Graphs generated through post-processing (programs/Post) are available in r-graphs.
Tables generated both by tabulation of confidential data (programs/Canada, used for both Canada and Germany), and post-processing (programs/Post) can be found in the tables directory.
The software used to generate the synthetic data is described in Kinney et al (2011b). A copy of the code can be obtained by request.
The raw synthetic and confidential data served as input to the various analyses described in the paper. These analyses occurred within the secure environments of the respective agencies. The code for the analysis is common to both countries (with minor adjustments to account for different variable names). The code used in the Canadian context is provided as a single Stata file in the [programs/Canada](programs/Canada) directory.
The following programs are used to post-process the analytic results:
Numbered programs should be executed in the natural order. Other programs define locations and/or subroutines, and should not be executed. A convenience bash script run_all.sh is provided.
Vilhuber acknowledges funding through NSF Grants SES-1131848 and SES-1042181, and a grant from Alfred P. Sloan Grant (G-2015-13903). Alam and Dostie acknowledge funding through SSHRC Partnership Grant “Productivity, Firms and Incomes”. The creation of the Synthetic LBD was funded by NSF Grant SES-0427889.
These data are licensed under a Creative Commons Attribution-NonCommercial 4.0 International license. See citation for attribution.