Affiliations: Department of Statistical Science, Duke University, Durham, NC, USA
Correspondence:
[*]
Corresponding author: David McClure, Department of Statistical Science, Duke University, Durham, NC 27708, USA. E-mail:[email protected]
Abstract: Several statistical agencies release synthetic microdata, i.e.,
data with all confidential values replaced with draws from statistical models,
in order to protect data subjects' confidentiality.
While fully synthetic data are safe from record linkage attacks, intruders might
be able to use the released synthetic values to estimate confidential values
for individuals in the collected data. We demonstrate and investigate this potential risk
using two simple but informative scenarios: a single continuous variable
possibly with outliers, and a three-way contingency table possibly with
small counts in some cells. Beginning with the case that the intruder knows
all but one value in the confidential data, we examine the effect on risk of
decreasing the number of observations the intruder knows beforehand.
We generally find that releasing synthetic data (1) can pose little risk to
records in the middle of the distribution, and (2) can pose some risks to
extreme outliers, although arguably these risks are mild. We also find
that the effect of removing observations from an intruder's background
knowledge heavily depends on how well that intruder can fill in those
missing observations: the risk remains fairly constant if he/she can fill them in well,
and drops quickly if he/she cannot.