You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R1

Abstract

Synthetic data methods were designed to address the conflicting demands placed on data holders to unlock the research and policy potential of microdata while at the same time preserving the confidentiality of individuals. Recently, these methods have become more widely recognized in the UK and the provision of bespoke synthetic data has been approved to expand the use of one of the UK Longitudinal Studies. The process of producing useful synthetic data involves, however, a substantial investment of research time, as it always requires some customising for the characteristics of an individual data set. At the same time, a substantial part of it can be automated and this is essential when the process has to be conducted rapidly and on a regular basis. This paper describes the application of synthetic data to the UK Longitudinal Studies, details implementation process for the Scottish Longitudinal Study and presents methods used in an R package synthpop that has been developed to facilitate production of non-disclosive entirely synthetic data. A reproducible example using open data is given to illustrate the synthesising procedure and to provide insights into quality of synthetic data generated using different automated approaches.

References

[1] 

Rubin D.B., Discussion: Statistical disclosure limitation, Journal of Official Statistics 9: ((1993) ), 461-468.

[2] 

Caiola G., and Reiter J.P., Random forests for generating partially synthetic, categorical data, Transactions on Data Privacy 3: ((2010) ), 27-42.

[3] 

Drechsler J., Synthetic data sets for statistical disclosure control: Theory and implementation, New York: Springer Science+Business Media, (2011) . doi: 10.1007/978-1-4614-0326-5.

[4] 

Drechsler J., New data dissemination approaches in old Europe - synthetic datasets for a German establishment survey, Journal of Applied Statistics 39: ((2012) ), 243-265.

[5] 

Drechsler J., and Reiter J.P., Sampling with synthesis: A new approach for releasing public use census microdata, Journal of the American Statistical Association 105: ((2010) ), 1347-1357. doi: 10.1198/jasa.2010.ap09480.

[6] 

Kinney S.K., and Reiter J.P., Tests of multivariate hypotheses when using multiple imputation for missing data and disclosure limitation, Journal of Official Statistics 26: ((2010) ), 301-315.

[7] 

Reiter J.P., Satisfying disclosure restrictions with synthetic data sets, Journal of Official Statistics 18: ((2002) ), 531-544.

[8] 

Reiter J.P., Releasing multiply imputed, synthetic public use microdata: An illustration and empirical study, Journal of the Royal Statistical Society, Series A: Statistics in Society 168: ((2005) ), 185-205. doi: 10.1111/j.1467-985x.2004.00343.x.

[9] 

Reiter J.P., Using CART to generate partially synthetic public use microdata, Journal of Official Statistics 21: ((2005) ), 441-462.

[10] 

Abowd J.M., , Stephens B.E., , Vilhuber L., , Andersson F., , McKinney K.L., , Roemer M., and Woodcock S., , The LEHD infrastructure files and the creation of the quarterly workforce indicators, in: Producer Dynamics: New Evidence from Micro Data, Dunne T., , Jensen J.B., and Roberts M.J., eds, Chicago (IL): University of Chicago Press for the National Bureau of Economic Research, (2009) , pp. 149-230.

[11] 

Kinney S.K., , Reiter J.P., , Reznek A.P., , Miranda J., , Jarmin R.S., and Abowd J.M., Towards unrestricted public use business microdata: The synthetic longitudinal business database, International Statistical Review 79: ((2011) ), 362-384. doi: 10. 1111/j.1751-5823.2011.00153.x.

[12] 

Hattersley L., and Cresser R., The Longitudinal Study, 1971-1991: History, organisation and quality of data, LS Series no. 7, London: The Stationery Office, (1995) .

[13] 

Boyle P., , Feijten P., , Feng Z., , Hattersley L., , Huang Z., , Nolan J., and Raab G.M., Cohort profile: The Scottish Longitudinal Study (SLS), International Journal of Epidemiology 38: ((2009) ), 385-392.

[14] 

O'Reilly D., , Rosato M., , Catney G., , Johnston F., and Brolly M., Cohort description: The Northern Ireland Longitudinal Study (NILS), International Journal of Epidemiology 41: ((2009) ), 634-641.

[15] 

Nowok B., , Raab G.M., and Dibben C., Synthpop: Bespoke creation of synthetic data in R, Journal of Statistical Software 74: ((2016) ), 1-26. doi: 10.18637/jss.v074.i11.

[16] 

Abowd J.M., , Hawala S., , Ricchetti B., and Stinson M., Testing Disclosure Risk in the proposed SIPP-IRS-SSA Public Use Files, Document submitted to the Census Bureau's Disclosure Review Board on November 16, (2006) . Available from: https://www2.vrdc.cornell.edu/news/wp-content/uploads/2007/11/drbmemonov2006.pdf.

[17] 

Drechsler J., , Bender S., and Rässler S., Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB Establishment Panel, Transactions on Data Privacy 1: ((2008) ), 105-130.

[18] 

Hu J., , Reiter J.P., and Wang Q., , Disclosure risk evaluation for fully synthetic data, in: Privacy in Statistical Databases, Domingo-Ferrer J., ed., Lecture Notes in Computer Science 8744. Heidelberg: Springer, (2014) , pp. 185-199.

[19] 

McClure D., and Reiter J.P., Assessing disclosure risks for synthetic data with arbitrary intruder knowledge, Statistical Journal of the International Association for Official Statistics 32: ((2016) ), 109-126.

[20] 

Elliot M., Final report on the disclosure risk associated with the synthetic data produced by the SYLLS team, Report 2015-2, Cathie Marsh Institute for Social Research (CMIST), University of Manchester; (2014) Available from: lhttp://www.cmist.manchester.ac.uk/research/publications/reports.

[21] 

Raab G.M., , Nowok B., and Dibben C., Practical data synthesis for large samples, Submitted. Available from: http://arxiv.org/ pdf/1409.0217v7.pdf.

[22] 

Breiman L., , Friedman J.H., , Olshen R.A., and Stone C.J., Classification and regression trees, Belmont (CA): Wadsworth, (1984) .

[23] 

Drechsler J., and Reiter J.P., An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets, Computational Statistics and Data Analysis 55: ((2011) ), 3232-3243. doi: 10.1016/j.csda.2011.06.006.

[24] 

Nowok B., , Raab G.M., , Snoke J., and Dibben C., Synthpop: Generating synthetic versions of sensitive microdata for statistical disclosure control, R package version 1.3-1; (2016) . Available from: https://CRAN.R-project.org/package=synth-pop.

[25] 

R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; (2016) . Available from: https://www.R-project.org.

[26] 

van Buuren S., , Groothuis-oudshoorn K., Mice: Multivariate imputation by chained equations in R, Journal of Statistical Software 45: ((2011) ), 1-67. doi: 10.18637/jss.v045.i03.

[27] 

Therneau T., , Atkinson B., and Ripley B., Rpart: Recursive partitioning and regression trees, R package version 4.1-10; (2015) . Available from: https://CRAN.R-project.org/package= rpart.

[28] 

Council for Social Monitoring. Social Diagnosis 2000-2011: Integrated Database; (2011) . Available from: http://www.diagnoza.com/index-en.html.