3 Formal Privacy

The literature on formal privacy is vast. Here, we focus on those papers that will help orient the reader to the key ideas and tools of differential privacy, particularly those relevant to the economic problem of determining optimal privacy protection when publishing population statistics. For the purpose of this reading course and associated bibliography, we associate formal privacy as a literature emerging in computer science out of cryptography. In the section on “Statistical Disclosure Limitation”, we recommend additional readings from the SDL literature, which has a distinct origin in statistics.

Dwork, McSherry, Nissim, & Smith (2006) is generally regarded as the first paper to formally introduce differential privacy. Its development was due, in part, to the database reconstruction theorem published by Dinur & Nissim (2003), which showed that most data publication mechanisms are blatantly non-private in a well-defined sense. The database reconstruction theorem has recently been shown to have very practical consequences for the statistical disclosure protections in place at the Census Bureau. The concepts of k-anonymity (Sweeney, 2002) and l-diversity (Machanavajjhala, Kifer, Gehrke, & Venkitasubramaniam, 2007) are important antecedents that bridge the formal privacy and SDL literatures.

In assessing formal privacy as a framework for modeling data publication, it is natural to consider the optimal, or maximal amount of data accuracy that can be provided while maintaining privacy. Gupta, Roth, & Ullman (2012) establish a mechanism that is universally optimal for a broad class of data users, suggesting that technical efficiency could be guaranteed in private data publication without the need to learn about the preferences or side information of data consumers. However, Brenner & Nissim (2014) showed that their result is not generalizable to broader types of data publication. Nissim, Orlandi, & Smorodinsky (2012) provide a clear and instructive illustration of how individual preferences for privacy can be modeled in mechanism design problems.

Several papers are more directly relevant to understanding how privacy affects the work of statistical agencies. Cummings, Echenique, & Wierman (2014) argue that privacy concerns can affect the way people report data, and show how, if properly designed, privacy protection may mitigate misreporting. While there are vast number of papers on the implementation of differentially private publication algorithms, a few are particularly relevant to how statistical agencies operate. Hardt, Ligett, & McSherry (2012) and Hardt & Rothblum (2010) provide the methods and theory for publication through online query systems. One problem with these methods is that their implementation depends on the underlying data. By contrast, Li, Miklau, Hay, McGregor, & Rastogi (2015) introduce the Matrix Mechanism, which is data-independent, but also can provide high accuracy for reasonable levels of privacy. This is one of the methods under development for use with the 2020 Decennial Census. Other formal privacy systems currently in use at Census are documented in Machanavajjhala, Kifer, Abowd, Gehrke, & Vilhuber (2008) and Haney et al. (2017). For practitioners, Machanavajjhala & Kifer (2015) provide an accessible and very useful overview of how formal privacy methods might be applied to real-world data publishing tasks.

Finally, so-called “privacy semantics” are mappings between mathematical definitions of privacy and plain language. These are really important for practitioners because there is usually a gap between how laypeople think about privacy and how it is defined in the CS literature. By way of introduction, we recommend He, Machanavajjhala, & Ding (2014), Jorgensen, Yu, & Cormode (2015).

References

Brenner, H., & Nissim, K. (2014). Impossibility of differentially private universally optimal mechanisms. SIAM Journal on Computing, 43(5), 1513–1540. https://doi.org/10.1137/110846671

Cummings, R., Echenique, F., & Wierman, A. (2014). The empirical implications of privacy-aware choice. CoRR, abs/1401.0336. Retrieved from http://arxiv.org/abs/1401.0336

Dinur, I., & Nissim, K. (2003). Revealing information while preserving privacy. Proceedings of the Twenty-second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 202–210. https://doi.org/10.1145/773153.773173

Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006). Calibrating Noise to Sensitivity in Private Data Analysis. Proceedings of the Third conference on Theory of Cryptography, 265–284. https://doi.org/10.1007/11681878_14

Gupta, A., Roth, A., & Ullman, J. (2012). Iterative constructions and private data release. Proceedings of the 9th International Conference on Theory of Cryptography, 339–356. https://doi.org/10.1007/978-3-642-28914-9_19

Haney, S., Machanavajjhala, A., Abowd, J. M., Graham, M., Kutzbach, M., & Vilhuber, L. (2017). Utility cost of formal privacy for releasing national employer-employee statistics. In SIGMOD ’17. Proceedings of the 2017 International Conference on Management of Data. https://doi.org/10.1145/3035918.3035940

Hardt, M., Ligett, K., & McSherry, F. (2012). A Simple and Practical Algorithm for Differentially Private Data Release. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in neural information processing systems 25 (pp. 2339–2347). Retrieved from http://papers.nips.cc/paper/4548-a-simple-and-practical-algorithm-for-differentially-private-data-release.pdf

Hardt, M., & Rothblum, G. N. (2010). A multiplicative weights mechanism for privacy-preserving data analysis. 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, 61–70. https://doi.org/10.1109/FOCS.2010.85

He, X., Machanavajjhala, A., & Ding, B. (2014). Blowfish privacy: Tuning privacy-utility trade-offs using policies. In Proceedings of the acm sigmod international conference on management of data (pp. 1447–1458). https://doi.org/10.1145/2588555.2588581

Jorgensen, Z., Yu, T., & Cormode, G. (2015). Conservative or liberal? Personalized differential privacy. 2015 IEEE 31st International Conference on Data Engineering, 1023–1034. https://doi.org/10.1109/ICDE.2015.7113353

Li, C., Miklau, G., Hay, M., McGregor, A., & Rastogi, V. (2015). The matrix mechanism: Optimizing linear counting queries under differential privacy. The VLDB Journal, 24(6), 757–781. https://doi.org/10.1007/s00778-015-0398-x

Machanavajjhala, A., & Kifer, D. (2015). Designing statistical privacy for your data. Communications of the ACM, 58(3), 58–67. https://doi.org/10.1145/2660766

Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., & Vilhuber, L. (2008). Privacy: Theory meets practice on the map. Proceedings of the 2008 ieee 24th international conference on data engineering, 277–286. https://doi.org/10.1109/ICDE.2008.4497436

Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. (2007). L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, 1(1). https://doi.org/10.1145/1217299.1217302

Nissim, K., Orlandi, C., & Smorodinsky, R. (2012). Privacy-aware mechanism design. Proceedings of the 13th acm conference on electronic commerce, 774–789. https://doi.org/10.1145/2229012.2229073

Sweeney, L. (2002). Achieving k-anonymity privacy protection using generalization and suppression. International Journal on Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 571–588. https://doi.org/10.1142/s021848850200165x